The Social Data Revolution

The Social Data Revolution

MS&E237

Spring 2010

Stanford University

Andreas S. Weigend, Ph.D.

The Social Data Revolution:

Data Mining and Electronic Business

Andreas Weigend (

The Social Data Revolution: Data Mining and Electronic Business:

MS&E 237, Stanford University

April 6, 2010

Class 3: Course Content, Wikimedia, SKOUT

This transcript:

Corresponding audio file:

To see the whole series: Containing folder:

Course Wiki:

Andreas:Welcome to class number three of the Social Data Revolution. I looked through all of the non-binding initial proposals for your projects, your ideas, and I indentified on very serious problem; that you think it’s very easy to get data. It’s actually not easy.

For instance, Intuit was not able to get data to us because their legal department is worried. If you think about it, if someone makes up a story that during tax season - Intuit is actually sharing some data that could be personally identifiable with a class and it might have their stock drop by 1%. That is certainly not a risk they were willing to take.

I want to start today’s class with one data source we can happily use, which is the data by the Wikimedia Foundation. You all know Wikipedia. Who of you has edited an entry in Wikipedia? That’s a small number. Why did the rest of you never change or edit anything? Was it too much work, too unclear, too hard?

Student:The level or organization needed to reformat a page would be too much, so wouldn’t bother. Most of the time I wouldn’t find too many errors, I guess.

Andreas:We have Eugene Kim and [0:01:47 Haway Fung]. [Haway] went to Stanford, and Eugene went to Harvard. They went to good schools. I hear Harvard is somewhere on the east coast.

Eugene:The Stanford of the east coast.

Andreas:They do work with Wikimedia. I thought I’d take the first 15 minutes today, giving them the platform, giving us some ideas of what we could be getting in there. In the spirit of social data, Wikimedia/Wikipedia is a beautiful example of people knowingly and willingly sharing data. At the very end of class today we have a startup I work with on the board, SKOUT, which is another example of a company where you can get some ideas about what you could do with data, which people are happy to share. I will turn it over to you.

Eugene:It’s a pleasure to be here and to talk with you all about Wikimedia. I want to do a few more interactive things, just to get … raise your hands. I’m going to ask some questions and if you qualify, if your answer is yes, I want you to actually stand up. The question is how many of you have ever accessed Wikipedia? If you’ve ever accessed Wikipedia, stand up. Very good. Are you not standing because of the laptop?

100% here. If you accessed Wikipedia today, remain standing. Okay, and then we have the number of people who said they had edited. If you have ever edited Wikipedia, remain standing. If you have ever edited Wikipedia more than once, remain standing. 3 people, very nice. Have you been editing Wikipedia for more than a month? Remain standing. 2 people left. How long have you been editing Wikipedia?

Student:I haven’t done it in the last month but I’ve been doing it for several years.

0:04:18

Eugene:What are some articles you’ve edited?

Student:Mainly around different types of web development and geography specific stuff that has to do with what I do for my web -

Eugene:Stuff that you’re personally interested in. How about you?

Student:The same, maybe 3 or 4 years. Not to make a point out of it, but reading an article that I got to because to and am interested in and I see something missing or wrong; I’ll correct it or add it. Frequently it’s around politics, music, or web development, things I’m interested in.

Eugene:Very good, thank you. Everyone here has accessed Wikipedia before. Everyone here knows you have accessed Wikipedia before. I presume everyone knows what Wikipedia is. This is your professor’s Wikipedia page. Wikipedia is basically an encyclopedia that anyone can edit. You can see that edit tab. This is the new Wikipedia interface. Probably most of you haven’t seen this yet. It will roll out in the next couple of weeks. This will be what you see.

One of the things you’ll notice about Wikipedia is that it’s edited by anyone. If you look at the revision history of this particular page, you can see that it was last edited in July of 2009, by JamesAM. Do you know who that person is? Do you recognize any of these names?

Andreas:Toby

Eugene:Toby, who is not on it. He’s probably further down this list. Basically, if you have a Wikipedia page or if there are things about you, things you’re interested in, it’s quite possible that completely random people are contributing to pages about you. Somehow it works, and it’s amazing. On the English Wikipedia there are about 3 million articles. Though all of the Wikimedia sites combined are essentially making up the 5th most accessed website in the world today, approximately 400 million people access Wikipedia every month, which is pretty cool.

What people don’t necessarily know is there is actually a vision underlying Wikimedia. This vision has actually been in place for a long time. The way it’s articulated at best; imagine a world in which every single human being can freely share in the sum of all knowledge. That’s our commitment. It’s a lofty goal.

0:06:43 What I’ve been working on with the Wikimedia Foundation, and the question we’re grappling with is how are we doing towards that goal. We know we have big projects now, and we know there are a lot of people who know about it, a lot of people access it; how close are we to getting to the point where everyone in the world has access to the sum of all knowledge.

These are questions we’re asking ourselves. What I was tasked to do is to basically leave an open strategic planning process where we develop a 5-year strategic plan, meaning we basically decide what the priorities are, what we should focus on over the next 5 years. If we start with this vision statement, there are a lot of things we can start pulling out of this. One of them is “every single human being” so that’s essentially a question of reach. Are we reaching every single human being right now?

This is sort of a theory of change. What it’s doing is articulating all the different factors that are going to contribute to the goal and their relationships to each other. You don’t have to worry about the level of detail right here, just the main thing to point out is there is a virtuous circle between reach, quality, and participation. All of those three things are heavily related to each other. If we improve the reach of Wikipedia or the other Wikimedia projects, that are going to increase quality, increase participation, all of the things are going to self-reinforce each other. That’s the theory we’re operating on right now.

In terms of reach, we started by asking a couple of questions. We know that 400 million people are accessing Wikipedia every month. Based on the world population, that’s about 15%. The question is what is the actual country break down, where are people accessing Wikipedia from? Based on that country break down, should we actually be prioritizing certain regions of the world?

There are questions about what kind of content people actually want. If people are not accessing Wikipedia, or any of the other Wikimedia content projects, is it because the content they’re actually looking for is not there now? Then there is the question - I mentioned the virtuous cycle before about how reach is related to participation. How do we convert readers to participants?

What we saw in the room just now was everyone is a reader, and maybe 2% or so are actually contributors. Is there a way to boost that number? That number is relatively consistent to what we know about Wikipedia right now.

This is the worldwide state of Wikipedia right now, in terms of reach. You’ll notice the dark blue is where we have the best penetration. Are there any Canadians in the room here? You guys are the most active Wikipedia readers right now, 40%. If you can believe it, in the United States, only about 35% of people who are online access Wikipedia. That’s knowingly or unknowingly.

When we think about reach, there is actually a significant population of people who are online in the United States who are not actually reading Wikipedia at all. Then if you look at the developing world, look at Africa. Africa is entirely under 30% in terms of access. I few look at China, India, Brazil, all of those places are clearly possible places where we can target. This next slide shows Internet growth.

0:10:15

Student:Do you have any idea who the … who don’t access…

Eugene:No, we have some clues and this is one of the reasons we’re here today. There are a lot of questions we have. There are some answers but a lot more we would like to know so there is an opportunity to take our data, which is basically all publicly and freely available. That’s one of the ethics of the Wikimedia project, and to do really interesting analysis based on that.

This is showing the growth numbers. You can see that before, we’re under represented right now in Africa, China, and India, and yet those are sort of the biggest growing countries in terms of Internet access right now. These are actually Internet numbers. If you add mobile as well, mobile boosts those numbers but it doesn’t shift where these regions are growing. For example, Internet growth is very high in China right now; mobile growth is very high in China right now.

As you can see from this picture, in terms of prioritization, there are some clear opportunity spaces we should be looking at over the next 5 years, and one of the things we’re trying to figure out is how to target those areas.

I want to talk about participation and some of the questions we have. While all of this content is really great and I think we have enough stuff to start targeting different areas, and there are a lot of opportunities there, participation is the lifeblood of Wikipedia. It is what makes everything work. At the end of the day, if we don’t have people like the people who stood up here as contributors, there is no content for everyone in the world to access.

One of the priorities that have emerged over the last 5 years centers around this participation question. Some of the questions we have are how do we encourage more participation, what are the different types of participation? We talked a bit about why people are editing Wikipedia or how people are editing Wikipedia, and there are actually very different levels of activity.

There are people who are just correcting typos every once in a while. There are also people who are literally spending several hours a day tenderly editing content, adding content, looking at other peoples’ pages, fixing grammar, adding new comment, double checking sources, really involved in the community and participating on the meta level as well; talking to other Wikipedians online, forging relationships, and all those other things.

0:12:36How do we encourage new editors to become active editors? A big question for us right now is around community health so we believe very strongly that there is a strong correlation between community, the social health of the community and the quality of the content that emerges and quality of participation that emerges. The question is how do we measure that. As I pointed out, there is a relationship between participation and quality, so what is that specific relationship.

Here are some numbers to give you a picture of what we understand about participation right now. The metric we’ll use to differentiate between just people who have edited Wikipedia and people who are considered active is if you make 5 or more edits, you’re considered an active Wikipedian, at 5 or more per month. That’s an arbitrary number. We needed to pick something so we picked 5 and we’ve been using that number ever since. This is about a 4 or 5 year old metric.

This chart measures a couple of different things. Number one, the area of each of the circles represents the size of the different projects. One of the things I didn’t mention before is that Wikipedia actually consists of 250 different projects. Each language gets its own Wikipedia. Each of those language versions are essentially its own community. They have their own contributors, their own governance rules. Occasionally there is some translation between each of the Wikipedias but for the most part it’s all original content.

English Wikipedia, which was the first one that started, that big circle to the left, that’s the biggest Wikipedia right now. There are over 3 million articles. After that, you can see Germany is close behind and then France, Japan, and the Spanish Wikipedia. You can see on the y axis, that’s measuring the number of active contributors. Not surprisingly, the biggest Wikipedias also have the highest number of active contributors. The x axis is measuring contributor growth.

These larger Wikipedias are actually not growing at all, or growing slightly. Then we have Russia which is sort of a mid-sized Wikipedia and it’s growing quite rapidly. It’s an outlier in this stuff. Then we have the smaller Wikipedias here, and in some cases they’re growing and in some cases they’re kind of dying. They’re kind of [in stasis] right now. This is sort of a picture of where all the Wikipedias are right now.

One of the things some of you might have read in the news recently is editors are leaving Wikipedia in droves. It’s a typical interesting media spin on things, but one of the things that’s absolutely true is that participation across the different Wikimedia projects have actually tailed off. This is a picture of active contributors, number of active contributors per month, starting in January 2001, which is when Wikipedia started, all the way to January 2009.

You can see we basically peaked in January 2007, and all of a sudden we’ve got this weird behavior going on. Something happened in 2007 and we don’t know what it was to cause this kind of behavior. There is speculation about maybe there was something policy wise or some technological change that happened, or maybe the community has just gotten too big and is starting to get unfriendly, so you’ve hit a natural limit.

0:16:12We’ve found, and this is all research that Ed Gee’s group did at Xerox Park; when you look at all of the different projects, around January 2007 you have the same plateau effect, even though all of the projects are self sustained, they’re all different sizes, they all have their unique community. Something happened in January 2007 that was probably a worldwide phenomenon that basically affected all of Wikipedia growth. We don’t know what that is. We have some guesses and it will be interesting to explore that. That’s also a potential project.

This is showing the same kinds of effects for some of the smaller Wikipedia projects.

This is a chart that Ed Chee also did, which his research group did, and it measured reversion rates. One of the things you can do in Wikipedia is if you see someone has edited a Wikipedia page and you think it’s a bad edit, you can revert it so it goes back to what it was previously. You find these different colored lines measure the activity of the user. These purple and blue lines below are people who are very experienced editors, people who have edited between 100 and 1,000 times or more. In this red line up top, you see people who have made one edit and a line below it are people who have made less than 10 edits.

One the left, we’re measuring the rate at which peoples’ edits are reverted. I come in; edit a Wikipedia page for the first time. If that edit sticks around then that’s great. If the edit gets reverted, that counts in terms of the reversion rate. What we’re finding is there is actually a big class difference happening now on Wikipedia between new editors and experienced editors. New editors tend to get punished, or not, because of the reversion rate. Experienced editors seem to sort of get away with that.

There is a question about whether or not reversion rate is an indication of community health. Perhaps it’s actually a sign of increasing quality on Wikipedia, the fact that you’re getting a lot of random people who are coming on board, and in fact there is a lot more noise. You have to revert more frequent. Or, maybe there are other things that it’s indicating. These are questions we’re trying to figure out.

That leads to the last thing I wanted to mention, which was around quality. I don’t have any charts to show you on that right now, but I can cite different studies around this. The main questions we have right now are how do we measure quality of content, how do we measure the quality of experience? Quality is not just about content, but peoples’ experience on the site itself. How do we indicate the quality of content?