Matt Ginzton
CS224N, Spring 2001
Final Project
6/11/2001
Project: Usenet (or email) message thread analyzer
- Description of problem
I wanted to work on live, real-world data, so for my domain I chose threads of Usenet articles. The idea is to examine the parent/child relationships between messages and analyze the degree to which child messages really respond to topics raised in their parent message(s); an approach based on Latent Semantic Analysis (LSA) seemed most promising, especially, one based on Pocklington and Best’s paper which used a similar idea to track the “replicatory success” of messages versus the concepts (found via LSA) that they contained.
Related attempts have been made by Berry and Dumais (with their work on LSA in general), and by Pocklington and Best (who actually worked on Usenet articles themselves). My project performs computations similar to those of Pocklington and Best, but uses the results to examine possible conceptual relations between messages (where they used the results to track the “replication” of messages containing conceptual “genes”).
- Algorithms/methods
The core algorithm I use is LSA, Latent Semantic Analysis. I create a term/document matrix where each column corresponds to a term in the whole corpus (several threads from one newsgroup), and each row corresponds to a document (an individual message from the threads). Each entry is a weight derived from the frequency of that term in the current document, and the number of documents containing that term relative to the entire corpus (Inverse Document Frequency, or IDF). This matrix is then decomposed via SVD (Singular Value Decomposition), yielding an orthonormal basis set of eigenvectors – the right orthonormal matrix is taken as the semantic subspace matrix, an indication of semantic concepts distilled from the term co-ocurrence information via SVD. Projecting the term/document matrix on to the semantic subspace matrix yields the term-subspace/document matrix, showing to what degree each document expresses each semantic concept. (This far, the algorithms are derived purely from Pocklington and Best’s LSA implementation as described in their paper, except for the IDF which I calculate slightly differently.)
I then take the entries from the term-subspace/document matrix (henceforth TSD), treating each thread individually, and for each message, look for the “most related” documents both older (taken to be possible parents) and newer (taken to be possible children). “Most related” could be calculated in several senses of similarity; I try taking messages whose TSD vector dot product is maximal (corresponding to the smallest angle between their vectors).
- Running my program
You invoke the thread analyzer as “thread.py”, which takes one optional parameter (the number of dimensions to track during SVD) and one mandatory parameter (the filesystem directory containing the thread data).
Note: you will need the optional “Numeric” package for Python to run this; unfortunately, it doesn’t seem to be installed on the Leland workstations. It’s necessary for the SVD algorithm which I didn’t want to reimplement, though. You can find information and downloadables for Numeric at
Datasets in “pseudothreaded” order (complete threads, ordered chronologically but with no parent/child information between messages) were obtained via cut-and-paste from Google’s Usenet archive ( I grabbed 3 threads each from the groups sci.skeptic and comp.sys.be.advocacy. The sample datasets are in directories named after the group, each containing one text file per thread.
Example command line: thread.py –k=100 sci.skeptic
This will read all the messages, perform LSA, and create a number of .html files in the data directory (sci.skeptic in this case), with various results of the calculations. Weights.html shows the term/document matrix with weights (after including the IDF), LSI.html shows the U matrix (right orthogonal matrix) output from SVD (each of the k rows here corresponds to one semantic “concept” found by LSA; for each concept, you can see the weight that each possible term contributes, as well as a sorted list of the terms in order of decreasing contribution), and TSD.html shows the term-subspace/document matrix, giving for each message (by thread) the degree to which it expresses each semantic concept. Warning: these files (especially weights.html and LSA.html) can be huge, and can cripple many HTML browsers!
Finally, one .html file is output for each thread (with the same base name as the .thread file) detailing an estimated reconstruction for the thread; each message is shown in complete text along with links to possible (deemed to be related) parent and child topics.
- Testing and results
- Discussion
This was a hard test domain because the language is real and unprocessed (misspellings, for example, are common) and infested with garbage data (such as .sig files at the end of each posting that will throw off frequency counts). Worse, the individual data samples are quite small (a post could be as short as a few words, and is rarely as long as one page).
Beyond that, it’s even hard to find threads that continue long enough to be worth analysis without degenerating into a flame war (Pocklington and Best noted this phenomenon as well).
A strong stemming implementation would also increase the quality of the results, by increasing true collisions between semantically identical words. I tried various naïve implementations of a stemmer but found they didn’t help, and a sophisticated approach would have been too complicated for the scope of this project; I do however throw out several stop words (as well as all words 3 letters or shorter, and all words containing numeric digits) which helped greatly.
To really track the flow of ideas, it would be interesting to treat each message as a conjunction of possibly separate remarks (perhaps, separated by quoted text, so that each unbroken response to a chunk of quoted text is seen individually), instead of lumping all the text of each response into the same document space. However, this would further exacerbate the problem of small data samples and would likely not be helpful in practice.
Overall, this was an interesting project but the problem domain was hard enough that good results are hard to see through the noise resulting from the Usenet data sets.
- References
Berry, Michael W. and Susan T. Dumais. “Using linear algebra for intelligent information retrieval”.
Manning, Christopher, and Schütze, Hinrich. Foundations of Statistical Natural Language Processing.
Pocklington, R., and Michael L. Best. “Cultural Evolution and Units of Selection in Replicating Text”, Journal of Theoretical Biology, v 188, n 1, September 7, 1997, p79-87.