CS224N / Ling 237

Programming Assignment 2: Word Alignment Models and Maximum Entropy Classifiers

14 December 2018

Thad Hughes, Marius Meissner


1.Word Alignment Models


Our task for this part of the assignment was to create models that are capable of aligning words in pairs of sentences that are translations of one another. We explored several ways of doing this, ranging from simple relative frequencies of the words in training sentence pairs to models trained using EM to determine the most likely words in one language given a word from another language. Our final system which gave us the best results is a linear interpolation of these alignment models, with weights obtained by tuning on validation data.

1.2Code organization

All of our alignment models are implemented in classes that implement the WordAligner interface, which we extended to include training and tuning methods. The unsupervised training method takes a collection of sentence pairs, and the supervised tuning method is given a collection of sentence pairs along with gold-standard alignments with which parameters of the model can be tuned.

We created an abstract implementation of the WordAligner class called ScoringAligner that is able to create and manipulate alignments in which the alignment of each possible word pair in the sentence is scored with a floating point number, and only alignments that receive a high enough score are translated into the Boolean-style alignment presence required by the testing code.

IBM Models 1 and 2 are implemented in their own classes. The EM algorithm is contained entirely in the Model 1 class, and the Model 2 class inherits from the Model 1 class and simply includes a distance from the diagonal when choosing whether to align word pairs.

1.2.1The Tree class

Instrumental in our efficient training was the Tree class, which we used instead of the Counter and CounterMap classes to represent sparse data. Each Tree node has a map in which children can be found, as well as a reference to some external data structure, such as alignment fractional counts or conditional probabilities.

The Tree gives us at least three distinct advantages over the CounterMap. First, once a node has been located in the tree (for example, to represent a conditional probability), several properties of that node can be accessed simultaneously without further lookups. In addition to lowering the number of lookups required to find data, storing the data in a single data-structure improves the memory-locality and thus cache performance of the code. Second, information, such as sums of fractional counts of child nodes, can be stored in the internal non-leaf nodes of the tree so that it doesn’t have to be calculated with another loop. Third, and perhaps most importantly, the result of an intermediate lookup can be held and re-used; for example, when using a nested loop over the words in two sentences, one can often avoid repeatedly looking up the word in the outer loop.

1.2.2EditDistance and CharacterNormalizer

In order to detect cognates between the two languages, we developed two utility classes. EditDistance determines the similarity between two strings by computing a modified form of the Levenshtein edit distance, which uses dynamic programming to determine the minimum edit cost to transform one string into the other. We attempted to create a basic cost function for the transformation of English words into French words; for example changing a letter from one vowel to another is less expensive than changing from a vowel to a consonant. Furthermore, changing to a letter that differs only by an accent is a very inexpensive edit. Removing the characters’ accepts is done by our CharacterNormalizer class, which uses the IBM International Components for Unicode library, included with the source as a JAR file.

1.3Required Alignments models

1.3.1Baseline and improved baseline

The baseline alignment model provided simply placed alignment markers on a line with a slope of 1. This makes very little sense if the sentences are of different length, since the ending words of the sentence will not be aligned. We also wrote a StretchAlignmentModel, which selects the alignments in each column that are closest to the diagonal. Neither of these alignment models uses training data in any way.

1.3.2Warm-up and improved warm-up

We implemented the suggested warm-up alignment model, which selected for each English word the corresponding French word from each sentence that maximized the following score:

When looking at alignments produced by this score, we noticed patterns that led us to try a slightly different scoring model, which performed significantly better. Instead of dividing by the product, we instead divided by the sum:

We further enhanced the warm-up model by having it prefer diagonal alignments and prefer to align words that appeared to be cognates. These alignment models use training data to determine P(f,e), P(f), and P(e).

1.3.3IBM Model 1

IBM Model 1 uses a variant of the EM algorithm to iteratively create better and better approximations of the conditional probability distribution p(f|e). For all English words e and French words f that occurred in sentences that are translations of one another, Model 1 starts by assuming p(f|e) is totally uniform. For words that did not occur in a matching sentence pair, p(f|e) is assumed to be 0. Using the p(f|e) values, EM then iteratively computes fractional counts for each word pair for all possible alignments, and then uses those fractional counts to adjust the conditional probabilities p(f|e). Further, the possibility that an English word might align to none of the French words is considered by the EM algorithm by introducing a NULL word into each French sentence.

Once training data has been used to estimate p(f|e), sentences can be aligned by selecting, for each English word, the French word that maximizes p(f|e). If the word selected is the NULL French word, then the English word is not aligned with any words in the French sentence.

1.3.4IBM Model 2

Our version of IBM Model 2 also uses the EM process to train conditional word probabilities, but includes an additional penalty for alignments that do not occur on the diagonal of the sentence. In our Model 2, the penalties for alignments that do not occur along the diagonal are a constant function of the distance from the diagonal that increases as the distance increases. We found experimentally that a distance penalty of the form


(with p=0.5) worked well.

1.4Extra Alignment Models

1.4.1Using Cognates

Because the Romance languages share common roots, many of the words have similar forms. We experimented with using word similarity to detect words that should align with one another. The CognateAligner produces a scored alignment that aligns words from the French and English sentences that appear to be cognates with one another.

1.4.2Symmetric alignments with bi-directional Models 1 and 2

The training used in Models 1 and 2 is asymmetric in the sense that alignments created using p(e|f) are different than those created with p(f|e). However, real alignments are in a sense inherently symmetric in the sense that if an English word aligns with a French word, the French word also aligns with the English word. Because of this, we decided to train the models in both directions and experiment with combining the results of both models. We first tried computing the union of the alignments, which performed poorly, and later tried the intersection, which performed extremely well.

1.4.3Weighted Voting by Alignment Models

Generalizing on the idea of combining the results of two alignment models, we created a special alignment model called InterpolateAligner that combines the results from several alignment models using a weighted linear interpolation of the scores given to each individual potential alignment. Each model that is being interpolated first computes a score for each possible word-pair alignment, storing the score in a rectangular matrix. The InterpolateAligner then linearly adds all the matrices, and chooses to align word-pairs whose total score is higher that a cutoff value.

Currently, we combine scores from four different aligners. The first simply prefers alignments along the diagonal. The second prefers aligning cognates. The third and fourth are the results obtained from using Model 2 in the p(f|e) and p(e|f) directions, respectively.

Several of the alignment models have extra parameters that can be adjusted to change the performance of the model. For example, Model 2 has a distance penalty that can be used to increase or decrease the penalty for alignments that occur off the diagonal of the sentence pair. Most importantly, our InterpolateAligner has individual weights and a threshold parameter that are used to combine the results of the individual alignment models that it uses. To tune these parameters, we iterated over many possible combinations of these parameters to find the values that minimized the AER as measured on the gold-standard validation data.


Figures 1 through 3 below show the AER, Precision and Recall for all of our alignment models when trained on 1000, 5000, and 10,000 test sentences. Overall, the best AER rate we achieved was 21% when training on 10,000 sentences and using the InterpolateAligner, which combines the results from a forward and reverse Model1Aligner, a CognateAligner, and a DiagonalAligner.

As expected, the BaselineWordAligner performs poorly, giving an AER of 68.6%. However, the StretchWordAligner “stretches out” the alignments to fall on the true rectangular diagonal and achieves an error rate of 57%, a significant improvement. Our first warm-up aligner performed worst with an AER of 69%. After looking at the results it produced, we decided to try a sum instead of a product in the denominator and achieved significantly better results. After some experimentation, and adding preferences to align cognates and to align along the diagonal, we achieved an AER of 31%, which was quite surprising. Our Model1Aligner (Model1NoCognatesFE) beat the first warm-up model but not the improved warm-up model, while our Model2Aligner (Model2NoCognatesFE) was able beat the improved warm-up model. Using cognates during the EM process slightly improved the alignment scores in all cases.

One of our most interesting discoveries was the asymmetry of alignments produced using p(f|e) and p(e|f). At first, we were computing alignments using Models 1 and 2 using p(f|e), but when we experimented with using p(e|f) (Model1NoCognatesEF and Model2NoCognatesEF), the results were even better. Model2NoCognatesEF achieved an AER of 28.5% when trained with 5,000 sentences. Interestingly, the p(e|f) models trained on 10,000 sentence pairs performed worse than when trained on 5,000 sentence pairs, due to drastic drops in precision. This is highly counter-intuitive, and we are unable to explain it.

Even though the p(e|f) models produced worse AER results when trained on 10,000 sentences, we were still able to achieve our overall best result when interpolating the alignment scores from four different models: a model that prefers to align along the diagonal, a model that aligns cognates, and two Model1Aligners running in opposite directions.

Figure 1

Figure 2

Figure 3

1.6Error Analysis

1.6.1Learned EM translations

After the EM process completes, we have estimates for p(f|e) or p(e|f). For each English or French word, we examined the three most likely words from the other language. Although neither of us speaks French, in many cases we were able to see that EM was finding correct translations. However, in many cases, EM was clearly failing. For example, in the p(e|f) model, EM was preferring to align “.” with “<NULL>” rather than with the obvious choice “.”. For example, EM was choosing to align “thirty” with “president.” It isn’t clear to us why this occurs, although it seems possible that the distribution of words in the sentence pairs could cause this behavior.

1.6.2Cognate errors

Some words appear to be cognates but actually aren’t. For example, the French “dans” is not a cognate with English “days,” but our cognate detector flagged them as such. Further, if a word appears twice in the sentences, the cognate score will prefer to align each occurrence of the word in one language with each occurrence from the second, when in reality each of the occurrences in one language should only align with one of the cognates from the other language.


We found that we were able to produce surprisingly accurate alignments without using sophisticated statistical methods like EM. However, using the EM algorithm, and in particular running it in both directions and combining the results with other simple alignment models produced our best results. We were also impressed with the ability of the EM algorithm to automatically discover which words are likely translations of one another.

2.Maximum Entropy Classifiers


In this part of the assignment, we completed the maximum entropy classifier implementation provided. We wrote functions to compute log probabilities of data as well as objective functions and their derivatives needed when adjusting the weights of the model. We then designed more sophisticated features to enable the classifier to separate proper nouns into five groups.

2.2Building a Simple MaxEnt Classifier

2.2.1Getting Log Probabilities

The first step in the maximum entropy classifier implementation was to compute log probabilities of labels given a certain datum. To do this, we first multiply the feature counts by the current weights for each label. Then, we find the probability by taking the exponential of this product, and normalizing:

Finally, we take the log of the above expression to prevent underflows. We return the array of the thus found log probabilities.

2.2.2Calculating Objective Functions and Derivates

As a next step, we use the previously found log probabilities to compute the objective function as well as its derivative as follows:

Notice that the derivative is simply the (negated) difference between the actual counts of features and the counts of features as predicted by our current model. Lastly, we want to make sure that we don’t assign a single label too high of a probability mass, so we regularize our model by penalizing large weights as follows:

2.3Feature Engineering for MaxEnt Classifier

Once the maximum entropy infrastructure was in place, we had to come up with more sophisticated features for the noun classification problem. The model originally had only unigram features in place, so a natural way to extend this seemed to be higher order n-gram features. We also implemented some other relatively simple features, namely the length of the word and a count of special characters.

2.3.1N-gram Features

For the N-gram features, we simply add a feature for every n consecutive characters that appear in a given word. So, for example for “William” we would add “Uni-W”, “Uni-i”, and so forth for unigram features, and then “Bi-Wi”, “Bi-il”, “Bi-ll”…,”Bi-am”, and so forth for tri- and higher order n-gram features. In practice, we were able to extend this up to 7-gram features before memory became a concern for MaxEnt training.

2.3.2Length of Word Features

A simple feature to aid in classifying in addition to the n-gram features is the length of a given word. We therefore add “Length-x” to the feature set for each proper noun, where ‘x’ is the length of the character array.

2.3.3Special Character Count Features

Lastly, we thought that a count of special characters such as “ “ (space), “.”, “-“ and upper-case letters would be a good indicator of a noun’s class, so we add “NumSpace-n” etc. to each feature set.


We ran our maximum entropy classifier with different sets of the above described features, and for each of the sets with the N-gram features varying from 1 (Unigram only) to 7 (Unigram up to 7-gram). The results are plotted in Figure 4. As can be seen in the graph, the accuracy for the N-gram model goes up the higher order N-gram model we allow. However, at some point, the improvements in accuracy gained by adding another level N-gram are only marginal. Notice for example the small gain from 5-gram to 6-gram, and further to 7-gram. In fact, when we ran this model on an 8-gram implementation, the accuracy actually dropped slightly, suggesting over-fitting of the data.

Secondly, the graph illustrates the gain in accuracy obtained when adding length and/or special character count features. This gain however only manifests itself for the lower order N-gram model. Once we allow 5-gram or higher features, they essentially encode the same information as the extra length and character count features, and thus the accuracy levels out at about the same value of 86.76%.

Figure 4


During this part of the assignment, we were able to train a maximum entropy classifier system that can fairly accurately distinguish between five categories of proper nouns. We found that high-order N-gram features up to about 7-gram give a great gain in accuracy compared to unigram features alone. Furthermore, other primitive features like word length can help, but their impact becomes less and less with higher order n-gram features. In general, we can conclude that maximum entropy classification is a well-suited algorithm for category classification tasks, if the right features are employed.

Appendix A: References

[1] Kevin Knight, A Statistical MT Tutorial Workbook. MS., August 1999.

[2] Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2001. Fast Decoding and Optimal Decoding for Machine Translation. ACL.

Appendix B: Contributions of Individual Collaborators

Thad and Marius did most of the programming side-by-side, although Thad focused more on optimizing the word alignment models and Marius focused more on the Maxent models. Thad had written the EditDistance code a long time ago to find duplicate copies of songs in his music collection.