Automatic diactritization of Arabic text
Using Viterbi Algorithm
Moustafa Elshafei, Husni Al-Muhtaseb, and Mansour Alghamdi
Abstract:
In this approach we formulate the problem of generating Arabic diacritized text from unvowelled text using a Hidden Markov Model HMM approach. In this model, the word sequence of unvowelled Arabic text is considered an observation sequence from an HMM, where the hidden states are the possible diacritized expressions of the words. The optimal sequence of diacritized words (or states) are then obtained efficiently using Viterbi Algorithm. We present the basic algorithm and its evaluation, and discuss various ramifications for improving its performance.
Key words: Arabization, Tashkeel, diacritization, vowel generation, Arabic linguistics.
1. Introduction
One of the problems facing computer processing of Arabic text is absence of the diacritical marks in the modern printed text. Native Arabic readers can identify the proper vocalization of the text, but when it comes to computer processing, the computer still needs to be provided with algorithms to mimic the human ability to identify the proper vocalization of the text. Such tool is an essential infrastructure for such applications as Text-to-Speech and Automatic Translation.
The problem of automatic generation of the Arabic diacritic marks, Al Tashkeel Al-Aly, is known in the literature under various translations, e. g., automatic vocalization, vowelization, diacritization, accent restoration, and vowel restoration. In this paper the term diacritization will be used due to the fact that the missing symbols do not represents only the vowels but represents, in addition to that, gemination shaddah and lack of vowels sukoon.
RDI of Egypt has developed a system called ArabDiac 2.0 based on their morphological analyzer, as well as syntactical rules and statistical information. The system is claimed to achieve 95% accuracy on the word level [4]. However, the detailed of their technical approach has not been published. Among other systems are Al-Alamia of Kuwait [9] and CIMOS of France [10]. The formal approach to this problem involves a complex integration of the Arabic morphological, Syntactic, and semantic rules [8, 12, 13]. A morphological rule matches the undiacritized word to known patterns or templates and recognizes prefixes and suffixes. Syntax applies specific syntactic rules to determine usually the final diacritical marks by applying STG. Semantics help to resolve ambiguous cases and to filter out hypothesis.
The approach here falls under the general class of statistical methods in pattern recognition, and has been applied successfully in speech recognition field. The system is first trained using a corpus of fully diacritized test, and preferably covering specific domain. The system generates word vocabulary list and determine the frequency of each word. It also generates word bigrams and trigrams, and performs other preprocessing and post processing steps to deal with exceptions and to overcome some known limitations. The overall system achieves about 2 % errors in letter diacritization restoration.
Arabic writing system possesses 36 letters which represents the consonants. They are: ا , آ , أ , إ , ئ , ؤ , ء , ى , ة , ب , ت , ث , ج , ح , خ , د , ذ , ر , ز , س , ش , ص , ض , ط , ظ , ع , غ , ف , ق , ك , ل , م , ن , هـ , و and ي. Each Arabic letter represents a consonant with some exceptions:أ , إ , ئ , ؤ and ء represent the glottal stop, but it is written in different forms depending on its position in the word and its adjacent phonemes. ا symbolizes the glottal stop or the long low vowel, depending on its position in the word and sentence. و and ي are long vowels when preceded by a vowel of its nature, i. e, dhammah and kasrah, respectively, and they are consonant otherwise. In addition to the consonant symbols, a diacritic may be placed above or below a letter (see Table 1.). Almost all Arabic texts are written in only the consonant symbols, i. e, the letters without the vowel symbols. So, a word such as “علم” when diacritized can be: “عَلَم” flag, “عِلْم” science, “عُلِمَ” it was known, “عَلِمَ” he knew, “عَلَّمَ” he taught or “عُلِّمَ” he was taught. An Arabic reader has to install the appropriate diacritics based on his linguistic knowledge and the context. However in the case of a text-to-speech or automatic translation system, Arabic letters have to be diacritized, other wise, the system will not be able to know which word to select.
Diacritic / IPA / Definition / Sampleَ / a / fathah (low short vowel) / بَ
ُ / u / dhammah (high back rounded vowel) / بُ
ِ / i / kasrah (high front vowel) / بِ
ّ / : / shaddah (geminate: consonant is doubled in duration) / بّ
ْ / Æ / sukoon (the letter is not vowelized nor geminated) / بْ
Table 1. Arabic diacritics with their International Phonetic Alphabet representations (IPA), definitions and samples with the bilabial letter ب. The first three diacritics represent the Arabic short vowels.
In Section 2 we present the formulation of the problem, while in Section 3 we outline the basic algorithm. In Section 4 we describe the training set and its processing, and in Section 5 we present detailed evaluation of the results and the modification to eliminate certain classes of restoration errors.
2- Problem Formulation
We assume we have a large training set of Arabic text with full diacritical marks, and its corresponding unvowelized text . We then generate a word lists, , of the unique and fully vowelized words in . We also generate a table, , of the frequency of occurrence of each word in , such that is the number of occurrence of in the training text . Similarly, we construct of all unvowelized words in . Let be the mapping from to ; For each word we define a subset corresponding to all the vowelized words that are mapped to , i.e. .
Now, given a word sequence (without diacritical marks)
; (1)
Where, we assume that , that is to say that all the words in (1) exist in . We wish to determine the most probable diacritized word sequence:
(2)
Where for some .
The word sequence D may be chosen to maximize the posteriori probability
, i.e. .
(3)
Assuming an average of Ku diacritized variants for each undiacritized word in , the number of search paths would be . It is then imperative to explore techniques for efficiently performing such search.
The conditional probability can be written as
(4)
It is normally assumed in language modeling that the sequence of words obey the Markovian assumption, that is at any given t, wt depends only on the previously words. In this case Equation (4) can be simplified as follows
(5)
In Trigram language modeling, each word is assumed to depend only on its previous two words in a second order Markov chain. In this case Equation (5) becomes
(6)
Similarly, in Bigram language modeling, each word is assumed to depend only on its previous word in a first order Markov chain, i.e.
(7)
The search for the best sequence of vowelized words which maximizes (3) can be illustrated with the help of Figure 1. The shown trellis can be viewed as a Hidden Markov Model HMM, where the observation sequence is the undiacritized word sequence W, while the possible vowelized words, , of each word wt represent the hidden states. The problem can then be viewed as finding the best state sequence given the observation W. The solution of this problem is usually approximated using Viterbi Algorithm VA, and can be seen as an application of dynamic programming for finding a maximum probability path in a graph with weighted arcs. The computation proceeds in a column-wise manner, at every input word index t, it updates the scores of the nodes in a column by means of recursion formulas, which involve the value of the probability of the best paths ending at the previous column, and the transition probabilities of the words.
Figure 1. HMM states corresponding to the word sequence W.
Let us define to be the probability of the most likely partial state sequence or path until time t, and ending at the ith state ( the ith diacritized word corresponding to wt.).
The algorithm proceeds in the following steps:
Step 1: Initialization
(8)
Step 2: Induction
Let be the index of the word in , i.e. and let be the cardinal of the subset .
(9)
(10)
Step 3: Best Path
(11)
Step 4: Back Tracking
(12)
By keeping track of the state j giving the maximum value in the recursion formulas (9) and (10), it is possible, at the end of the input sequence, to retrieve the states visited by the best path. The Viterbi algorithms have a time complexity , where M is the length of the input sentence, and is the average number of the states corresponding to the words in the input sequence.
First we notice that the conditional probability appearing in (9) can be written as:
(13)
Since each vowelized word is mapped to one unvowelized base, this implies that . We can then simplify further Equation (13) as follows:
(14)
Moreover, it is clear then the denominator of (14) is not part of the maximization in (9). This leads to a further simplification of the recursion (9) as follows:
(15)
In other words, the evaluation of the recursion, (9) and (10), can be performed using the conditional probability of the vowelized words only.
All the probabilities needed for computing the Viterbi recursion can be obtained from the statistics of training sets, and stored for on line Viterbi calculations. The probabilities are basically the unigram and bigram that are derived from the frequency of occurrence of individual words or frequency of occurrence of joint words as follows:
1- , (16)
where C stands for the count of occurrence in the Text.
2- (17)
3- (18)
4- (19)
3. The Training Database
The system should be trained based on domains of knowledge, e. g. sports, weather, local news, international news, business, economics, religion, etc. For testing purpose we started by a fully diacritized transcript of The Holy Quran (HQ). HQ’s list file consists of 78,679 words and 607,849 characters with no spacing.
The text is converted to a word list after removing numbers and special symbols. A word is defined here to be any sequence of letters and diacritical marks delimited by the space character or a punctuation mark. However, a single symbol is inserted in place of punctuation marks to indicate sentence boundary. This symbol is needed to collect statistics on the starting and ending words. The Arabic letter extension character is also removed. A Matlab procedure was built to accept a Unicode text file and generate the list file. Next, a table is generated consisting of the vocabulary in the list file and its number of occurrence. Two words are considered identical in the regular sense if , where is defined as follows:
One problem with the defining similarity between words using the above definition is that the sukoon diacritical mark does not consistently appear explicit in the text. A metric is designed so that two words are still considered identical even if one of them is missing one or more sukoon. Define the mapping which strips words from sukoon diacritical marks. Let , and . Then two words A and B are said to be R0 identical, R0(A,B)=1, if R(A0,B0)=1. When generating , all words in which are R0 identical are represented by a single word in the vocabulary . The vocabulary word is selected or generated to be the one with all its sukoon cases removed for efficient memory utilization. Finally, each word in the table is mapped to its undiacritized base . A new database is generated which contains the undiacritized word bases, and a list of the corresponding diacritized words, and their corresponding counts. A Matlab program is built to automate building of this database. The new database is called “uvwords_db”. Next, a database is generated for each unique sequence of two words in the list file together with its number of occurrences. In this case, two two-word sequences are considered identical if their corresponding words are R0 identical. The data base generated is called “bigram_db”.
Another problem in creating the training databases is that there are many articles, and common short words are not fully diacritized. The missing diacritical marks represents a serious problem. At the minimum it unnecessarily increases the size of , and creates ambiguity as it would not be clear if the missing diacritics is simple sukoon or not. The problem can be partially alleviated by creating a lexicon of exception cases of common words and articles which usually appear partially or totally undiacritized.
6. Experimental Results
The word list file came to 18,623 diacritized words, consisting of 179,910 characters. The corresponding table of base words consists of 15,006 words consisting of 80,864 characters.
The maximum number of tashkeel words for any undiacritized word was found to be 12. The test set consists of 50 randomly selected sentences from the entire Quran text. The test set contains 995 words and 7657 characters. When applying the Viterbi algorithm in its basic form resulted in 230 errors in letter vowelization in 4234 undiacritized characters, that is 5.43% errors in diacritic marks of letters. The analysis of these errors is important to explore future directions of improving the basic algorithm. Errors in determining the end cases of words account for 94 errors. The formal approach to resolve these cases is to implement a syntax analyzer. The syntax processing can be inserted as a post processing stage after the Viterbi algorithm. It was noticed that a considerable number of these end-case errors are repeated, and occur in a few frequently used words. Accordingly, a simple approach to reduce the number of end case errors could be devised based on higher order grams for those most frequently used words. For example, the error in resolving the end cases of the word allah الله accounts for 15 errors. The use of a trigram or 4-gram is expected to resolve the majority of these cases.