Text Classification and Multilinguism:
Getting at Words via N-grams of Characters
Ismaïl Biskri & Sylvain Delisle
Université du Québec à Trois Rivières
Département de mathématiques et d’informatique
Trois-Rivières, Québec, Canada, G9A 5H7
~delisle}
ABSTRACT
Genuine numerical multilingual text classification is almost impossible if only words are treated as the privileged unit of information. Although text tokenization (in which words are considered as tokens) is relatively easy in English or French, it is much more difficult for other languages such as German or Arabic. Moreover, stemming, typically used to normalize and reduce the size of the lexicon, constitutes another challenge. The notion of N-grams of words (i.e. sequences of N words, with N typically equals to 2, 3 or 4) which, for the last ten years, seems to have produced good results both in language identification and speech analysis, has recently become a privileged research axis in several areas of knowledge extraction from text. In this paper, we present a text classification software based on N-grams of characters (not words), evaluate its results on documents containing text written in English and French, and compare these results with those obtained from a different classification tool based exclusively on the processing of words. An interesting feature of our software is that it does not need to perform any language-specific processing and is thus appropriate for multilingual text classification.
Keywords: Numerical text classification, N-grams, (multilingual) natural language processing, knowledge extraction, text databases.
1. TOKENS, WORDS AND CHARACTERS
When processing a large corpus with a statistical tool (see, amongst others, [2] and [16]), the first phase typically consists of subdividing the text into information units called tokens. These tokens usually correspond to words, at least for the most part of them—“non words” tokens could be pictures, numbers, special characters or symbols. This tokenization process may appear to be quite simple, not to say trivial—tokenization, morphological analysis, and lexicons are discussed in [17], in the context of corpus-based processing. However, from an automated processing point of view, the implementation of this process constitutes a challenge. Indeed, how to reliably recognize words? What are the unambiguous formal surface markers that can delineate words, i.e. their boundaries?
These questions are relatively easy to answer for languages such as French or English: basically, any string of characters delimited by a beginning space and an ending space is a simple word. But for many other languages, such as German or Arabic, the answer is much more complicated. For instance, this long token lebensversicherungsgesellschaftsangestellter is a term corresponding to “life insurance company employee”—so this is a compound noun, not a word. In Arabic, subject pronouns and complements are sometimes attached (concatenated) to the verb. In this case, a token like kathabthouhou corresponds in fact to a sentence (here, “I wrote it” or “I’ve written it”). Obviously, the simple notion of tokens defined as strings of characters separated by spaces is an oversimplification that is highly inadequate for many situations and languages (see [13]).
Considering the above, what then could constitute a reasonable atomic unit of information for the segmentation of text, independently of the specific language it is written in? Balpe et al. [1] suggest that this unit should be defined according to the goal we set ourselves when reading or processing a text. More precisely, from a numerical classification-based knowledge extraction viewpoint, the definition of the basic unit of information to be considered depends on the following:
- The unit of information must be a portion of the input text submitted to the numerical analysis processor (a numerical classifier as far as we are concerned in this paper);
- From an automated processing point of view, it should be easy to recognize these units of information;
- The definition of the unit of information should be independent of the specific language the text is written in;
- This definition should allow the numerical analysis processor to handle a large variety of languages, eventually with minor adjustments to the processor;
- The units of information must be statistically meaningful when evaluated or compared between themselves. It should be easy to compute their frequencies in various parts of the input text, as well as to estimate their distribution and the regularity with which some units cooccur in certain portions (segments) of the text.
A consequence of the above requirements is the possibility that the defined unit of information could result in the text to be separated into units that do not necessarily correspond to meaningful linguistic elements. Taken individually, some units may even appear to be completely out of context. From a classification perspective, this is not a hindrance. However, this can lead to a problematic situation when results are to be presented to the user, especially when her interpretation depends on how she perceives and understands the units appearing in the results.
Most statistical text analyzers based on cooccurrence frequency computations use the word as the basic unit of information, even if such a choice does not meet all requirements listed above. The usual justification for this choice is that words are easy to recognize and understand for humans, even in situations where text written in various languages must be analyzed. With the continuous development of the Web and the availability of texts in a wide variety of languages, the need for processors able to handle inputs containing multilingual texts is now a priority. Thus, models for text (or corpus) analysis, be they numerical or linguistic, must take into account the multilingual nature of texts, especially those collected from the Web. This is our main concern in this paper.
2. N-GRAMS OF CHARACTERS
The notion of N-grams of characters (or phones and syllables, see [11]) has been in use for many years mainly in the field of speech processing. Fairly recently, this notion has attracted even more interest in other fields of natural language processing, as illustrated by the works of Greffenstette [9] on language identification and that of Damashek [7] on the processing of written text. Amongst other things, these researchers have shown that the use of N-grams instead of words as the basic unit of information does not lead to information loss. Examples of recent applications of N-grams include the work of Mayfield & McNamee [14] on indexation, the work of Halleb & Lelu [10] on automated multilingual hypertextualization (in which they construct hypertextual navigational interfaces based on a language-independent text classification method), and the work of Lelu et al. [12] on multidimensional exploratory analysis oriented towards information retrieval in texts.
Now, what is a N-gram of characters exactly? Quite simply, we define a N-gram of characters as a sequence of N characters. Sequences of two characters (N=2) are called bigrams, sequences of three characters (N=3) are called trigrams, and sequences of four characters (N=4) are called quadrigrams. Notice that the definition of N-grams of characters does not explicitly or implicitly require the specification of a separator, as is necessary for words. Consequently, analyzing a text in terms of N-grams of characters, whatever the value of N might be, constitutes a valuable approach for text written in any language based on an alphabet and the concatenation text-construction operator. Clearly, this is a significant advantage over the problematic notion of what a word is.
The use of N-grams of characters instead of words offers another important advantage: it provides a means by which to control the size of the lexicon used by the processor. Up until recently, the size of the lexicon has been a controversial issue, often considered as an intrinsic limit of processing techniques based on the comparison of character strings. Indeed, splitting up a text in words normally implies that the larger the text will be, the larger the lexicon will be. This constraint persists even if special processing is performed for functional words and hapax, and even if morphological and lexical analysis is performed on words.
However, an interesting property of N-grams of characters is the following: the lexicon obtained from the analysis of a text in terms of N-grams of characters cannot grow larger than the size of the alphabet to the power of N. Thus, analyzing a text in terms of quadrigrams (N=4) for a language with an alphabet of 26 characters would produce a lexicon containing a maximum of 456 976 (i.e. 264) potential quadrigrams. If “non-sense” combinations (i.e. AAAA, ABBB, BBBA, etc.) are eliminated, this figure will be drastically reduced. For instance, Lelu et al. [12] managed to reduce the size of the lexicon to 13 087 quadrigrams for a text containing 173 000 characters.
If N-grams of characters are used as the basic unit of information, instead of words, there is simply no need for morphological and lexical analysis. Not only these types of processing can be computationally demanding but, most of all, they are specific to each individual language. Thus, when using the word as the basic unit of information, language-specific processors must be developed for every language of interest. This is a potentially very costly constraint, both in terms of development time and in terms of required expertise for each language. Not mentioning the problem that texts written in unforeseen languages might cause at the time of processing.
But even if we do have lexical analyzers available, many often have trouble correctly handling words and their derivations. For instance, the French words informatisation, informatique and informatiser all refer to the concept of informatique (informatics). So if, in a given corpus, we have these segments, which have similar informational contents: “l’informatisation de l’école”, “informatiser l’école” and “introduire l’informatique à l’école”, many word-based processors will not be able to reliably detect such similarities. However, the analysis of the above three short segments in terms of trigrams (N=3) of characters is sufficient to classify these segments in the same class. Indeed, not only ‘école’ (school) appears in all three, but the trigrams inf, nfo, for, orm, rma, mat and ati allow the computation of a similarity measure supporting the conclusion that informatique is the common topic in these three segments. Of course, since these same trigrams also appear in information and informationnel, this could appropriately be considered as noise unless a higher-level interpretation is invoked, such as informatique being a subfield of information.
3. The GRAMEXCO SOFTWARE
Our software, called GRAMEXCO, has been developed for the numerical classification of large documents in order to extract knowledge from them. Classification is performed with a ART neural network such as the one used in [4]. The basic information unit is the N-gram of characters, N being a parameter. A primary design goal of GRAMEXCO during its development was to offer a standard flow of processing, regardless of the specific language being processed in the input documents. Another important design feature is that GRAMEXCO is semi-automatic, allowing the user to set certain parameters on the fly, according to her own subjective goals or her interpretation of the results produced by the software.
Figure 1 : Main Interface of GRAMEXCO (when dealing with segments and classes)
Starting from the input text, a simple ASCII text file, three main phases follow in which the user may get involved, as necessary:
1) The list of N-grams is constructed (with N determined by the user) and the text is partitioned into segments. These operations are performed simultaneously producing a matrix in which N-gram frequencies have been computed for every segment.
2) A ART neural network computes similarities between (cooccuring N-grams in) segments produced in the previous step. Similar segments, according to a certain similarity function, are grouped together. This is the result of GRAMEXCO’s classification process.
3) At this stage, N-grams have served their purpose: they have helped produced the classification of text segments. Now that we have the classes of segments, we can get at the words they contain. The words a class of segments contains is referred to as its lexicon. The user can now apply several operations (e.g. union, intersection, difference, frequency threshold) on the segments’ words in order to determine, for instance, a common theme—assuming she understands the language. Results interpretation is up to the user. On the previous page, Figure 1 shows the main user interface of GRAMEXCO corresponding to this third phase.
Depending on the parameters set by the user, and her choices during the three phases above, results produced by GRAMEXCO can help identify similar classes of text segments and their main theme. The results can also help determining word usages and meanings for specific words. These are important tasks in knowledge extraction systems. GRAMEXCO could also be useful for the classification of Web search results in order to help users reformulate their queries [8], a possibility we are about to consider. Because of the multilingual nature of the Web, our software offers significant potential that others based exclusively on the processing of words, instead of N-grams of characters, do not offer.
4. EXPERIMENTS AND EVALUATION
We have conducted several experimentations in which we tested the potential contribution of N-grams with regard to the tokenization of texts—in previous work, we also worked on the combination of statistical and linguistic processing, see [5]. Using our GRAMEXCO software, we show in [6] and [3] that classifications based on N-grams are qualitatively at least as good as those obtained from stemming. Since N-grams are quite simply defined as sequences of characters, independently of any language-specific considerations, they constitute a very interesting basis for a language-independent text classifier. Even more interesting is the possibility of handling multilingual texts, i.e. texts containing sentences written in several different languages.
As a matter of fact, the results we present below have been obtained from a 100-page corpus containing paragraphs (in no particular order) written in French and in English. The scale of such an evaluation is important since we are currently working on the use of GRAMEXCO (or eventual successors of this software) as a tool to assist information search on the Web, or as a tool to perform emails classification, or even as a tool to perform news classification from newswires such as Reuters ( or AFP ( With that in mind, we constructed a 100-page corpus from a random selection of English and French newspaper articles on various subjects. This corpus was submitted to our classifier in order to obtain classes of articles on the same topic and, also, help the user, normally an expert in her own domain, to identify and study the themes of these classes of articles, thanks to the lexicon automatically associated with each of these classes.
The two main parameters we have used for this experiment are the following. First, we used quadrigrams (N-grams of size 4), taking into account various practical factors. Second, we discarded N-grams containing a space and those having a frequency of one: this is done in order to minimize the size of the vectors submitted on input to the classifier and, thus, to reduce the work to be performed by the classifier. Interestingly, the removal of these N-grams has no significant impact on the quality of the results produced by the classifier.
The first noticeable result obtained from the classifier is the perfect separation of English and French articles. That is to say, quadrigrams are able to clearly distinguish English articles from those written in French. Qualitative analysis of the classes also allows us to observe that articles belonging to the same class are either similar or share a common theme. We now provide an interesting sample of the actual results produced by GRAMEXCO.
CLASS 101 (articles 238 and 245)
These articles are written in English and their common lexicon consists of {california, film, great, line, reports, reuters, time, week}. The shared theme of these articles appears to be related to Reuters reports on movies and Hollywood. An analysis of the full articles allowed us to confirm this interpretation.
CLASS 124 (articles 31, 32 and 36)
These articles are written in French and their common lexicon consists of {intellectuels, révolutionnaires, russie}. This lexicon allows us to suggest that the shared theme of these articles is that of Russian intellectuals and their position with regard to the Russian (Bolshevik?) revolution.
CLASS 58 (articles 144 and 145)
These are articles written in French and their common lexicon consists of {acte, avocat, avoir, désignation, enquête, factures, fausses, fisc, instruction, juge réquisitoire}. This lexicon allows us to understand that these articles discuss a legal affair having to do with fausses factures (counterfeit bills). The analysis of the lexicon also allows us to establish that the French word avocat is used in its ‘lawyer’ sense and not in its ‘avocado’ sense.
CLASS 61 (articles 151 and 152)
These articles are written in French and their common lexicon is {accéléré, aidé, américain, black, edwin, hitler, holocauste, ibm, identifier, journaliste, juifs, nazi, régime}. These articles talk about an investigation, lead by the American journalist Edwin Black, on the support offered by IBM to the Nazi regime for the identification of Jews during the Second World War. The lexicon provides all the relevant clues to draw such a conclusion. Also, the analysis of the lexicon also allows us to establish that the French word régime is used in its ‘regime’ sense and not in its ‘diet’ sense.
CLASS 74 (articles 178 and 179)
These articles are written in English and their common lexicon is {camps, case, chatilla, complaints, filed, israeli, palestinian, role, sabra, sharon, survivors, years}. This lexicon allows us to understand that these articles discuss the Israelo-Palestinian conflict, specifically the Palestinians’ complaints about the Sabra and Chatilla massacre. The analysis of the lexicon allows us to establish that the word ‘complaints’ is not used here in its illness-related sense but in its accusation, grievance, or dissatisfaction sense.
CLASS 87 (articles 206 and 207)
These articles are in English and their common lexicon {dock, docking, launch, obstruction, officials, russian, ship, shuttle, space, station}. This lexicon supports the analysis that these articles are about space conquest. The lexicon also supports the claim that the English word ‘space’ here refers to outer space and not to an interval between two letters or two words.
CLASS 93 (articles 221 and 222)
These articles are in English and their common lexicon is {california, marijuana, medical, science, university}. This lexicon leads us to think that these articles deal with the medical value of marijuana.