Alex Arkhipov 11/28/2006

21W.784 Final Project Report Beth Coleman

Automated Calculation of Word Correlations from Text

A Story (Significance of Word Correlation)

Imagine that someone has given me an enormous tome with text written in an unfamiliar language. I have no idea what the text is meant to be – a story, a recipe, or maybe an encyclopedia – or who wrote it, when, and why. The language has no recognizable punctuation or capitalization. I can see that the language has words made up of individual letters and can find words made of similar sequences of letters, but I can’t tell whether these are different forms of a word ( “compute” and “computer”), or completely unrelated words (“desert” and “desert”). Moreover, I don’t have any idea of how, tenses, plurals, suffixes, and prefixes work in the language, so decomposing each word into part is fruitless. So, I treat words are the basic unit, as if the language were made of hieroglyphics rather than letters, and best the plan of attack is to look for the same word appearing multiple times.

Figure 0: Repeated words in an unknown Chinese text

Looking through the text, I notice that not only do two words (call them A and B) appear reasonably often, but where word A is found, word B tends to appear nearby. Looking at where A and B appear, I tend to find only three other words, C, D and E, that consistently appear together with A and B appear nearby, but rarely outside – all five words cluster together within the text and it follows that the words form a conceptual cluster.

As I look at the text clusters of words, I find every instance of the words appearing together is marked by word B occurring, but not necessarily any of the other words. I conclude that B is the central hub of the cluster. For example, word B may be “baseball”, and A, C, D, and E are “bat”, “strike”, “pitcher”, and “run”. Perhaps the text is a story in which B is a character, with A, C, D, and E relatives of that character.

I continue through the text the same way. Some words are spread randomly and haphazardly within the text; these may be articles or prepositions that carry no inherent meaning. Other words group into clusters by proximity, some very distinct, and some vague. I assume each of the clusters represents a set of closely-associated concepts. I draw each word as a point on an enormous sheet of paper, and connect related words with a line, so that concept clusters form a tight net of lines with a visible central hub. Doing so for more and more words, I find many concept clusters. Some clusters overlap, while other clusters bridge pairs of distant concepts. Moreover, the concept clusters are themselves grouped into clusters, due to higher-than average connections between the words within them, and that these clusters form even higher-level cluster, and so on.

The nesting is rich and detailed, and using the large amount of text at my disposal, I am able to draw a complex and intricate web that shows all these relationships with great accuracy and detail. Working for many years, I complete Network I. Network I is a web so large that I have to zoom out so that the points that indicate individual words are too small to be seen, and tightly-bound concept clusters look like pinpricks from this view, and clusters of concepts look like tiny globs.

The network gives me a higher-level understanding of the language, even though I don’t know what any word or concept means. Then, working for many more years, I use my prior knowledge to draw a similar association network of the entire English language, called Network II. Looking at the two networks, I realize that they look alike and have similar “texture” – the amounts of links between clusters, the tightness of the average cluster, and so on, are about the same. Perhaps both networks are one and the same – representations of the human conceptual network, something that is independent of the language in which it is expressed.

Next, I study Network I and Network II in detail and realize that they have with similar structures. The shapes of the highest-level central hubs match up, and I can see which clusters map onto each other. Actual words, however, fail to line up in the networks, but this does not matter, because individual words are barely visible. I now take the original text, and give each word a vague translation based on which cluster the English counterpart is in. Although I don’t have an exact idea of the words or sentences, I can see what the text is about and the path through which it visits various ideas. Or perhaps I have stumbled on a parallel text caused by me mismatching Networks I and II, to give a text that is plausible, but not what is actually written?

Background Philosophy

While the story above may seem far-fetched (and it is), I believe that it is an effective allegory for what a language is. The ideas are largely influenced by the philosophy on minds and networks presented in the book Godel, Escher, Bach. The story also parallels the Chinese room argument, in which a man with no knowledge of Chinese can converse in Chinese by following precise coded instructions. This allegorical thought experiment is meant to illustrate that a machine could pass the Turing Test without actually understanding language.

The actual counterpart to my role in the story is an algorithm to process language. For such a program, all language is foreign and devoid of context and meaning, and the only clues are where words appear in relation to each other. I am interested in what information can be algorithmically extracted from a large corpus of text with no human intervention or prior knowledge.

Mining linguistic information and searching for patterns from a large, crystallized body of text is the core of corpus linguistics, a branch of computational linguistics. Such an approach may take a collection of text written in the early nineteenth century and to check for trends in the use of dependent clauses (to give a hypothetical example). This requires human intervention in tagging words with parts of speech and per-programmed heuristics to distinguish dependent and independent clauses. Although computational linguistics was once seen as promising in codifying language through machine-understandable rules, it fell into disfavor as people realized that rules and algorithms can only go so far in understanding language.

When I started the project, I was perhaps also naïve in hoping to automate the creation of a semantic network for a set of words. I planned to use only correlation data, which can be mined from text in a manner similar to that described in the story. The main problem is that correlation only gives vague relation data (A is associated with B), while semantic data is more precise about the relationship, such as (A is part of B, A is type of B, A is the same as B, A as the opposite as B), and their converses. For example, WordNet, a dictionary that organizes words as a semantic network, primarily using the relationships A is part of B (called meronymy), and A is type of B (called hyponymy). However, WordNet was created using human understanding of relationships between words, rather than computer analysis.

The main difficulty is that there is no apparent way to see how two words that appear together are related, unless one searches for linking phrases such as “is a” or “is like a” or “does not” and uses prior knowledge of these phrases to judge the relationship.

Project Outline

Corpus linguistics typically uses human intervention in preprocessing text by identifying root words or annotating which part of speech a word is. However, I wanted to see what I can do using only computer algorithms and no prior knowledge. Only then could my methods have broad applicability, such as to the hypothetical language in the story in which not only the meanings, but the parts of speech of the words are not known. Because I realized that semantic relationships cannot directly be extracted algorithmically, I limited the scope of my project to mining correlation data, rather than semantic data.

My main premise, apparent in the story, is that words appear together if and only if there is some association between them, even if the semantic connection cannot be determined. Given any two words, I an algorithm can estimate their correlation, a numerical estimate for the degree of association between the words. To do so, I compare how often both words appear together to how often they would appear together just due to random chance, since even unrelated words occasionally appear in close proximity. Then, I isolate clusters of words in a set by finding the correlation of each pair of words, and looking for subsets of words that correlate highly with each other. I am limited by computing power to detecting small subsets that have a clear relationship, such as distinguishing a set of fruits from a set of tools. However, I believe that my methods can be extended to searching for clusters of words linked by more subtle, conceptual relationships from a pool of randomly chosen words, as opposed to a set pre-made for experiment.

Search Engine Approach

I originally used Google, the most comprehensive and popular Internet search engine. When one searches for a keyword on Google, in addition to returning relevant web pages, Google lists the number of pages on the Internet that contain this keyword, or the hit count. It is this number that I concentrate on in my calculations. Google allows searches using Boolean operators; searching for two keywords connected by “AND”, for example, gives the set of pages containing both keywords, as well as the number of such web pages. Because Google indexes pages that contain human-readable information, I expect that the pages indexed on Google will be representative of the English language (I only searched English-language pages), so words that denote related concepts tend to appear together on web pages.

The advantages of using Google to mine the web corpus are the sheer size of the corpus, allowing me to get large amounts of varied text data so as to drastically lower the factor of randomness inherent in using text as a sample of the English language. Google pre-built web page database, massively parallel computing grid, and streamlined search algorithms.

The main drawback was that Google (and search engines in general) gives only limited and skewed data. I will discuss the difficulties this in more detail in the results section.

Search Engine Algorithm

(Note: Capital letters represent words. Also, when giving Google search queries, I enclose them in quotes, which are not part of the query.) For any two terms, X and Y, I search for each of X and Y, take the number of results, and divide by the total number of pages indexed by Google, a value that can be gotten using a search query that returns every page, such as “Z OR NOT Z” for any term Z (Figure 1). This division gives p(X) and p(Y), the fraction of web pages containing each of the terms, respectively, or equivalently the probability that a randomly chosen English-language page contains each of the terms. If the two probabilities are independent, then the probability of a given web page containing both terms equals the product p(X) * p(Y). This is the expected fraction of pages on which both words appearing.

Search Query / Index Size
pickle OR –pickle / 2.527E+10
the OR –the / 2.527E+10
aardvark OR aardvark / 2.527E+10
inurl:http / 2.527E+10

Figure 1: Determining the index size (done prior to index size change). The minus signs represent “NOT” operators.

Then, I compute the actual fraction of pages containing both X and Y by dividing the number of hits for “X AND Y” by the size of Google’s index. The actual fraction may differ from the expected fraction because the two probabilities are not independent. If the words X and Y are correlated, then the appearance of X makes the appearances of Y more likely, so p(X) and p(Y) are not independent, but positively related. Then, p(X AND Y) is greater than p(X) * p(Y). So, the ratio p(X AND Y) / (p(X)*p(Y)) quantifies the correlation of words X and Y. Note that the correlation between X and Y is (or at least should be) scale-free, meaning that it is not biased towards either frequently-occurring words or rarely-occurring words.

Search Engine Results

A few basic tests of the correlation property gave promising results (Figure 2).

Words / Hits / Probabilities (=Hits/Index Size) / Correlation

pickle / 1.060E+07 / 4.195E-04
cucumber / 1.050E+07 / 4.155E-04
cucumber AND pickle / 4.720E+05 / 1.868E-05 / 107.165
katamari / 3.240E+06 / 1.282E-04
damacy / 1.220E+06 / 4.828E-05
katamari AND damacy / 1.200E+06 / 4.749E-05 / 7671.524
aardvark / 7.640E+06 / 3.023E-04
raincoat / 2.580E+06 / 1.021E-04
aardvark AND raincoat / 8.210E+02 / 3.249E-08 / 1.053
rock / 5.780E+07 / 2.287E-03
align / 8.510E+07 / 3.368E-03
rock AND align / 5.090E+06 / 2.014E-04 / 26.150
capacitor / 1.410E+07 / 5.580E-04
granary / 1.680E+06 / 6.648E-05
capacitor AND granary / 7.470E+02 / 2.956E-08 / 0.797
apple / 5.640E+08 / 2.232E-02
banana / 6.010E+07 / 2.378E-03
apple AND banana / 6.530E+06 / 2.584E-04 / 4.868

Figure 2: Selected pairs of words with computed correlation

The correlation of the clearly-related words “cucumber” and “pickle” gave a correlation of 107, which means that the words appeared together about 100 times more than expected by chance. In contrast, the random terms “aardvark” and “raincoat” have a correlation of 1.05, or nearly one, meaning that they appear together only with chance probability. The terms “katamari” and “damacy”, Japanese words that form the name of a popular video game, give an enormous correlation of 7671, which is expected since the words rarely appear outside the game name on English-language pages. In contrast, the highly dissimilar terms “capacitor” and “granary” have a correlation of 0.797, meaning that the appearance of one word actually makes the other words 20% less likely to appear. Surprisingly, the correlation between the words “apple” and “computer” is significantly greater than that between “apple” and “banana”. However, further testing found unreasonably high correlations between unrelated words: 14.31 14.31 between “chartreuse” and “rudder” and 26.150 between “rock” and “align”.