Data Mining and E-Business: the Social Data Revolution s1

Transcript of Andreas Weigend

Data Mining and E-Business: The Social Data Revolution

Stanford University, Dept. of Statistics

Andreas Weigend (www.weigend.com)

Data Mining and Electronic Business: The Social Data Revolution

STATS 252

May 18, 2009

Class 7 Recommend: (Part 3 of 3)

This transcript:

http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_7dinner_2009.05.18.doc

Corresponding audio file:

http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_7dinner_2009.05.18.mp3

Previous Transcript: (Part 2 of 3):

http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_7recommend-2_2009.05.18.doc

To see the whole series: Containing folder:

http://weigend.com/files/teaching/stanford/2009/recordings/audio/

Ron: I’m Ron Chung, and I’m here with Nick Kallen, Yu-Shan Fung, and Nova Spivack, along with Andreas Weigend, who wants to talk with us all about the social data revolution.

Andreas: In class today, we were talking about real time search. That means that we have a time scale which is no longer the time scale of indexing and re-indexing the web, which we are used to from Google. It is a time scale of minutes, of really seeing the most recent stuff. The question that came out was where does that matter? Where does it matter about what’s happening in the world right now, as opposed to the integrated view, which has a more balanced view, going back into the past? Who wants to take this one?

Nick: I think I would start with asking a more basic question, which is what is the difference between real time data and non-real time data? Is this actually a new phenomenon? That is the first thing; we should have some skepticism. Is it clean? There is this new type of data and it needs to be indexed in a different way.

There are real time search engines, like Twitter Search; we claim that Google is not a real time search engine. What is the difference? Why can’t Google be a real time search engine right now?

Andreas: The question about real time and time scales has come up on Wall Street over the years. It turns out that now the key differentiator is where your box sits in the collocation facility. That the speed of light, for Wall Street trading, makes a difference. The decisions they make by front running somebody else is very different from being the one in second when the price has shifted.

From the question we are talking about, Nova, you gave some good examples in your presentation about where you think real time actually makes a difference, not noncontroversial ones, but I would be interested if you could share those with the audience.

Nova: I think having real time access and searching the current moment is really important when there is some kind of live event or ongoing event that you want to keep track of and you want to participate in it when you’re not there. There is this need to see what’s happening, right when it’s happening, and not afterwards, when you want to see the perspective of a lot of people, rather than reading one person’s perspective later on, whether it’s a movie and you want to see what people think about it right when it opens, and see a lot of perspectives, and not just one review, or it’s something like Burning Man and you want to see what’s going on when you’re not there.

0:02:40.6

Andreas: I would argue that there is a reason I don’t want to see ten thousand tweets about the plane landing in the Hudson River. It’s just really a high opportunity cost, which I could do other things in. I’m not happy if I see random stuff because I believe we can do better than random. I’m actually surprised because let’s say at Amazon.com, we don’t see whichever happens to be the most recent item that just arrived at the warehouse. At Google, we don’t happen to see whichever is the latest changed page. But, when it comes to Twitter, to the live web, to real time search, suddenly we throw away everything we have learned about costs of interrupt, of cost of search, of attention costs; we throw it away and succumb to that ordering which is pretty much random, namely whichever came last. How is it possible that people give up their hope of having someone or some group edit, or curate for them, which actually helps increase the signal of the noise ratio and instead they just go for random stuff?

Nick: I beg to differ whether it’s random. We’ve experimented a bit at Twitter with presenting information different ways, using traditional relevancy algorithms to present tweets, and honestly; we’ve found that presenting information in a chronological order is the most relevant way to present the information to the end user.

That’s a function of the type of questions that people are asking of Twitter Search. When you ask a question of Twitter Search, it tends to be the case that the information is less accurate, the older the information is. The kind of relevancy, the contemporaneous nature of the tweets that are being produced, as you’re issuing the query, they’re more relevant because they’re of a temporal character. They’re about events that are unfolding. I’m at a conference and I want to know which talks are interesting. I’m going to see a movie tonight and I want to see where the crowds are or I want to get tacos at the Korean Taco Truck; where is it right now. Something crazy is happening; there is smoke on Geary Street; what’s going on? Really, the accuracy of the data is as much about the recency of the data as it is about anything else.

Yu-Shan: I would like to challenge; is that really the right dichotomy to be thinking about? Does it have to be chronological versus relevance? If you think about search, the Holy Grail of search is the only one to return one result. In that case, the presentation question of whether it’s chronological or relevant, it doesn’t matter; you want to find the one thing you want to know at that point. In that case, shouldn’t you really want to factor both the temporal nature of the tweets, and take into account how long ago it has been said and so forth, and really come up with what is going to be most relevant for that given moment?

Ron: Are you talking about in the context of Mr. Tweet?

Yu-Shan: Not necessarily, we’re talking about real time search. I don’t really see that it has to be one way or the other.

0:06:19.6

Nick: I agree that these are not mutually exclusive, although I would take issue with the comment you just made that the Holy Grail of search is presenting one result. We, as people who are thinking about these issues, should have mental models of what exactly is the purpose of a search query. Are people asking questions that they’re hoping to get a factual answer to, like who killed John F. Kennedy, or when was so-and-so born? If so, then there is a single document that answers such a question. If that’s not the class of queries that people are asking, then maybe it’s not the Holy Grail, one single result.

Nova: Even with factual questions, there could be different answers. You could ask a question about geo-political boundaries and the answer differs with time. In 1960, a country might be inside another country and then afterwards it’s independent. Similarly, even in the present, different parties might disagree about how to represent the world.

I think you need to provide multiple perspectives, even when you think you know the answer; you should still show alternatives. I think when we look at the real time domain; the purpose of the real time domain is really to see what’s going on in the present. From that perspective, there are many things going on in the present, and there is no one right answer.

Andreas: My perspective is that dialog is the mode interactivity, where as in real life, we don’t give speeches when we talk with our friends. Those people who do tend to not fair all that well with their friends. We like to throw something out and see what comes back.

What I don’t understand is that dialogical mode is not as supported as it should be by search engines. We saw Twine today, and what I really enjoyed there was that you seem to be supportive, saying, “Here are the number of categories that are coming back; which one are you interested in?”

Nova: Where Twine is heading is increasingly towards helping users filter the web and track their interests. To do that, we have to understand what they really want. It’s really hard to ask a user to tell us that when they do their query because often they don’t even know.

What we can do is, based on what they tell us, we can then say, “Here are some other things we think you might be interested in,” to keep narrowing it down. It’s kind of a dialog interaction where the user gives us a little information; that enables us to give them some possible next steps and we keep going like that.

That is missing in a lot of search engines, today, and even in websites. I think this conversational interface is coming back, and certainly, with the rise of real time and the stream and the services like Twitter, conversation is actually becoming more and more a part of the web. I think it will also become part of the user interface, as people get more and more used to it, being able to converse with search engines and other applications.

0:09:41.9

Andreas: Throughout history, we have realized that a new technology which has been deployed tends to imitate the old technology. Then we found out that television really wasn’t just a better radio. We found out that the web definitely wasn’t just a better television. I think what we are going to see soon is that Twitter is not just a better messenging service. I am personally very excited to see what people are going to use these things for, and to support new users, and then to see what works and what doesn’t.

Ron: I would like to thank everyone for coming and chatting a little bit about real time messenging systems from Twitter, the Twine and its search engine-like categorization system, and Mr. Tweet with its relevancy on Twitter users, and lastly Andreas Weigend for really driving the social data revolution.

Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 5

http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_7dinner_2009.05.18.doc