Accuracy and Fluctuations in Modeling User Interests for News Filtering

Accuracy and Fluctuations in Modeling User Interests for News Filtering

Context and Interest Fluctuations in User Profiles for News Filtering[*]

Kenrick J. Mock

Intel Architecture Labs

Intel Corporation

Abstract

This work examined the issue of accuracy in modeling users within the task of browsing Usenet newsgroups. Two experiments were conducted. In the first experiment, subjects were presented with a series of news articles and asked to browse the articles and read those that appeared interesting. After reading an article, subjects ranked each article as being of interest, disinterest, or ambivalent. Upon completion of the browsing phase, subjects then read and classified all articles. In the second experiment, the same task was performed except subjects were split into two groups. Articles were displayed to the first group as in the first experiment, but articles were displayed to the second group with additional context in the form of a line of text from the body of the article. The experimental results indicate that the current system of browsing results in many messages that users do not read, but would be interested in reading. Furthermore, the addition of a line of text for context helped users differentiate among articles of disinterest better than if the text was not provided. Finally, the results indicate that users often change their mind about whether they like or dislike a particular article. These results suggest that a news filter would be a great aid in finding articles likely to be of interest that are normally missed. However, the accuracy of such a filter will be limited due to human inconsistencies. Complementary techniques such as data visualization should also be explored to better facilitate filtering by humans.

Keywords: news filtering, user profile, user modeling, model accuracy, user behavior

1. News Filtering and User Behavior

Due to the enormous size of the Internet, users have been forced to rely upon technology to ease the burden of search. Recently, several systems have been created that address the issue of information overload from the perspective of filtering (Mock, 1996; Lang, 1995; Lashkari et. al, 1994; Sheth, 1994; Mauldin, 1991). Although relevant to other areas, the work reported in this paper specifically focuses upon the problem of filtering Usenet news.

The goal in filtering news or other articles is to somehow select only those articles that are likely to be of interest to a particular user out of a large pool of data. The means to accomplish this goal is typically achieved by implementing an individualized user model that is either specified by the user, learned by observing user actions, or learned through direct user feedback (Stevens, 1992). Most existing systems require users to rate articles once they are read according to a classification scale ranging from “No Interest” to “High Interest” (Mock, 1996; Lang, 1995). This data is then used to construct the user model or user profile. Once the profile has been constructed, it can be applied to new articles. Articles that match the user’s profile will pass through the filter, and articles in conflict with the user’s profile will be discarded.

Although a number of systems, algorithms, and machine learning techniques have been created to filter Usenet news, an important question has not been examined in great detail: Is a filter even necessary in this domain? Studies have not been performed to determine whether or not filters are necessary. Existing methods of browsing and newsreading may already give adequate performance. How many articles are users currently reading that they would prefer not to read? Conversely, how many articles are users not reading that they would like to read? To explore these issues, two experiments were conducted. These studies examined only the issue of browsing through articles, not the issue of searching for a particular article or subject.

2. Gauging the Need for Information Filtering

In the first experiment, the classification of articles was compared when users browsed articles with a conventional news reader to situations when users were forced to read all articles. In a conventional news reader, users are given a list of articles in which the author and subject are displayed as shown in figure 1. Almost all newsreaders use this style of format. These articles are typically sorted by threads, where each thread tracks a post and its replies.

a A Leonard 1 >Police Action

b 4 >The fires in California

Khristy Parker

Bereli Goobis

Antonio Poe

c Marvin Ottolini 2 >Make Money Fast!

Figure 1: Sample screen from STRN newsreader, messages sorted by threads

One of the shortcomings of this format is that the subject of a message does not always accurately describe the content of a message. This is often the case in long threads when the topic of conversation has drifted away from the original subject. However, the change of content is usually not denoted with a new subject. Nevertheless, this format is popular today and has proven useful in supplying users with some indication of what articles are about and how the space of articles may be visualized. This experiment investigated whether or not this format for displaying messages provides sufficient information for users to accurately pick messages of interest.

2.1 Examining User Behavior - Experimental Method

The newsgroup selected for this study was the ucd.life newsgroup, a local newsgroup for the University of California, Davis. This newsgroup was selected since all of the subjects in the study were Davis students and the newsgroup covers a variety of topics likely to be of interest to the general community. Furthermore, this newsgroup received approximately 50 messages a day so that filtering may be useful. The subject matter of this newsgroup varied from Want-Ad postings to crime discussions. A selection of topics from the articles used for this experiment are:

Any bad experience with J str. Apts?

Art 111 home page

Bicycle geeks

Bunbun’s reality revisions

Furniture for sale

Unabomber in Davis

Davis Police Department

144 sequentially posted messages from the newsgroup were selected for the study. These messages were sorted into threads and displayed to the user in the standard news reader fashion, giving the author and subject, as shown in figure 1. Users were first instructed to browse the articles as they normally would, and read only those articles that looked interesting. After a user read an article, the system asked the user to classify the article as being accepted if she found it of interest, rejected if she really did not want to read the article, or ambivalent if she is unsure. In this manner, all of the articles the user decided to read during browsing were assigned a classification of accepted, rejected, or ambivalent.

After the browsing phase was complete and the subjects were satisfied that they had read all the messages they felt to be of interest, the subjects were instructed to read all 144 messages. For each message, users provided the same classification as before. If the existing methods for displaying articles is sufficient and no filtering is necessary, then the message classifications during the browsing phase should closely match the message classifications from when all messages are read.

A total of 14 unpaid volunteer university student subjects participated in this study. All subjects were familiar with existing news readers and had read the ucd.life newsgroup in the past. The subjects were naive regarding the purposes of the experiment.

2.2 Examining User Behavior - Results

Figure 2 shows the combined classifications results for all test subjects. Articles that were not read during the browsing phase do not appear in the figure. In addition to tallying the totals, the flip-flops were also counted. Flip-flops count the number of messages read during both the browsing phase and the all-messages phase, but which were classified differently. This constitutes article classifications that the user switched from reject to accept or from accept to reject. Switches to ambivalent classifications were not included in this value.

Figure 2. Classification results for subjects browsing messages and reading all messages.

321 messages were classified as accepted or rejected by the collective subjects during the browsing phase. Of these 321 messages, 104 were classified differently when the subjects were forced to read all messages, resulting in a flip-flop rate of 32%. The percentage of messages accepted during browsing is 53%, while the percentage of messages rejected during browsing is 27% and the percentage of unknown messages is 19%. The unknown percentage would be higher if included the unread articles. 2016 messages were read during the read-all phase. 34% of these messages were marked as accepted, 37% as rejected, and 29% as ambivalent.

2.3 Discussion - Finding Messages of Interest

During browsing, the subjects indicated that they were interested in only 212 messages (53% articles read). Only 109 messages were marked as rejected (27% articles read), indicating that the method of display did provide sufficient information for users to select more articles of interest than articles of disinterest. Many more messages went unexamined. In contrast, when the subjects read all messages, they indicated that they were actually interested in many more messages. A total of 697 messages were accepted, 329% more than the number read during browsing.

One explanation for these results is that displaying messages by author and subject alone do not provide enough information to allow users to pick accurately the messages they would like to read. In interviews with the subjects after the experiment was conducted, 12 indicated that they found most of the articles to have a different content than originally expected from reading the subject header alone. Another explanation for the higher acceptance percentage is that increased reading resulted in increased interest. Four of the subjects reported that they “started to get into it” after they started to read more messages. In other words, reading some messages generated additional interest in other messages resulting in more messages classified as accepted.

To increase the chances that “missed” messages of interest are read, two direct approaches may be used. First, an intelligent filter could identify those messages likely to be of interest and alert the user. This works only as long as the user trusts the system and what messages the filtering system recommends. Second, the interface used for browsing could be improved to include content from the body of each message to give the user a better indication of what the message is about. This should allow readers to make a more informed choice of which message to read.

On the opposite spectrum of finding articles of interest is rejecting articles not of interest. With over 37% of the messages classified as rejected, this comprises a majority of the three classifications. Additionally, users did a fairly good job of not reading articles they did not want to read. The number of disliked articles rose by 676% when subjects read all messages. This volume of rejected messages indicates that the capability to recognize these articles will certainly aid the reader in selecting relevant articles. It also indicates that this newsgroup contained many articles that users did not care to read.

2.4 Discussion - Inconsistency of User Interests

One of the unexpected results of this study was the high number of flip-flops; subjects who classified a message one way during the browse phase, then later classified the message differently when all messages were read. Out of the 321 messages classified during browsing, 32% of them (104 messages) were changed during the read-all phase. One of the reasons for this change may be due to increased user interest that resulted from reading more articles. Furthermore, subjects typically read very few messages during the browsing phase. This limited exposure to articles is not enough to gauge accurately what threads of conversations are about. Additionally, some subjects had a narrow threshold between acceptance and rejection. Depending upon the context, or even the mood of the reader, articles could be classified either way.

These flip-flops raise an issue about the maximum possible performance a filtering system can achieve. With 32% of the classifications changing, a very large error will result due to the fickleness of the readers. Many of these flip-flops stem from the limited number of articles initially read by the user. In an ideal setting, a user would only browse the messages she is interested in reading, and based upon these the system will filter future articles. However, this study has shown that users do not read enough browsed articles to build up a model of user interests accurately. Some of this error can be reduced by forcing users to read more messages to provide additional context for the articles, or to simply wait until enough data about the user has been collected. Nevertheless, in the end, any filtering system is subject to the whims and inconsistencies of the human user, making 100% accuracy virtually impossible to achieve.

3. Impact of Additional Context

The data presented in the first experiment suggests that users do miss many articles they find interesting and that an intelligent filter system could be useful to help find these articles. In support of this claim, many systems report improved accuracy as a result of filtering algorithms (Edwards et. al, 1996; Lang, 1995; Sheth 1994). A complementary approach to filtering is information presentation and visualization. As described previously, subject headers are not always an accurate indicator of the content of an article, especially in long threads when the conversation has evolved but the topic has not. A better presentation of the data that includes additional content regarding the articles should help the user more accurately select articles of interest regardless of whether any filtering is performed. However, the cognitive overhead to process the additional content must be kept to a minimum.

To investigate the impact of additional content into the interface display, a second experiment was conducted. This experiment included additional content with the expectation that additional content will allow the user to more easily filter articles on his own. Consequently, users provided with the additional context should select more articles of interest than users without the context. Conversely, the users provided with the additional context should also select fewer articles of disinterest than the users without the context.

3.1 Additional Context - Experimental Method

In this experiment, subjects were separated into two groups. Both groups browsed and classified articles they selected to be interesting, then read and classified every article in a manner identical to the first experiment. The first group had articles displayed in the same manner as conventional newsreaders, which was also the same as the first experiment (Figure 1). However, the second group had articles displayed with a small amount of additional context in the form of the first line of text from the posted article. Quoted material was removed so that the first line of text displayed was new text written by the author of the message, not text written by a previous author.

Figure 3 shows a sampling of the new article list displayed to the subjects in the second group. As in figure 1, articles are also sorted and displayed by thread. However, there is one additional line containing the line of text. While the line of text may provide additional context, this context comes at the price of taking up valuable space on the screen and also increasing the amount of cognitive processing required by the reader.

1. From: (Wilson)

Subject: Re: Incompetence in the White House? Urge Quayle to Run!

TextLine: We know Dan Quale. Dan Quale is a friend of ours. Gore is no

From: (RH)

Subject: Re: Incompetence in the White House? Urge Quayle to Run!

TextLine: Dan Quayle:

2. From: (Farkward P. Parkenfarker)

Subject: Re: More about the OZONE fraud

TextLine: You mean, you copied it so you could read it later? I suggest

From: (Michael Fern)

Subject: Re: More about the OZONE fraud

TextLine: No, fool.

From: (Paul Farrar)

Subject: Re: More about the OZONE fraud

TextLine: Hey Michael, here's another HALOE data paper that supports your

3. From: (Michael Zarlenga)

Subject: Re: US ATTACKS IRAQ

TextLine: How do we subsidize oil?

Figure 3. News display with addition of first line of text from the article.

The entire study was conducted over the World Wide Web. The articles were placed online, linked off of the UC Davis AI Lab web page, and a form was created to allow subjects to read articles and enter their classifications. Subjects entering the website were randomly placed into group 1 or group 2 with equal probability. In this manner, the subjects were not selected, but were individual volunteers nationwide that happened to be browsing the web, came across the site, and elected to participate in the study. Since the content of the news articles was time-sensitive, the study was online for only three weeks. During this period, a total of 26 subjects volunteered for the study, 15 in the first group and 11 in the second group. Since the subjects were random web browsers, no information was available regarding their age, education, location, etc.