Team Project: Predicting Phrase Translations

Overview

At a very high level, the problem of machine translation, which we will discuss later in the semester, has two main problems: mapping translation units from one language to another, and dealing with word order differences. For example, if we want to translate from English the blue house intoFrench la maison bleue, and we are using words as our translation units, we need to choose to map blue to bleue (and not to, say, triste, which means “sad”, another possible meaning for blue), and we also need to account for the fact that frequently an English adjective-noun ordering is expressed as noun-adjective in French.

the blue house

| X

la maison bleue

Many current translation systems go beyond single words as translation units, instead using “phrases” consisting of contiguous sequences of words. (These are not phrases in the linguist’s sense, but that’s the terminology.) The following well-known example illustrates alternative ways of “covering” a Spanish input sentence with phrase-to-phrase mappings. For example, the Spanish phrase no dio can map as a single unit to English did not give, although one can obtain the same covering by mapping no to did not and dio to give.

Systems are typically guided to good translations (or so they hope!) by models involving phrase-to-phrase mappings (e.g. the probability of English a given Spanish una is pretty high), target-language word sequences (e.g. Mary did not give is going to look better to an n-gram model than Mary no slap), and changes in word order (e.g. a phrase late in the Spanish sentence will generally tend to show up late rather than early in the English sentence).

Interestingly, MT models are virtually always expressed in terms of what happens in a single sentence. Even if the input is a whole document, state of the art statistical MT systems translate one sentence at a time, with each sentence-level translation taking place independently of information from other sentences. Intuitively, this seems wrong – from looking at word sense disambiguation, we know that document-level context can often help narrow down plausible word meanings, and there’s also a tendency for a word to be used in the same sense throughout a discourse. Shouldn’t those observations also help us when selecting English translation units to use in translating an input sentence from some other language?

In this project we are going to explore the idea that information within an entire documentmight be helpful in deciding which English translation units might be likely, when translating from another language into English. We are going to ignore issues related to word order, and we’re also not going to consider English n-gram models (which, while very valuable, provide within-sentence rather than document-level constraints). The focus here will be on identifying “interesting” English phrases that might be part of the translation of a foreign-language input document, and trying out various techniques for assigning scores to those phrases, using information from the entire document.

Note that project is designed so that if we produce a good solution to this sub-problem, the results could very easily be plugged into a statistical MT system and we could see if it helps in translation. What you’re being given here is not a textbook problem; rather, it’s part of the very much open problem of how to do better machine translation. That has advantages and disadvantages. On the positive side, it’s more fun. On the negative side, this is open territory and it’s possible that unforeseen problems will crop up with the assignment – either in how it’s formulated, or with the materials I give you, or in system issues. If that happens, let me know and we’ll adjust accordingly.

Input and Output

The input will be a tokenized French document file, one sentence per line. (Following MT convention, “French” here can be a stand-in for any source language.) By convention we will number sentences and tokens within sentences starting from 0. That is, “sentence 0 token 0” will refer to the first token in the document.

The output will be a list of lines containing the following tab-separated fields:

  • French sentence number
  • Start token number
  • End token number
  • Scorefor proposed English phrase
  • Proposed English phrase (any number of space-separated tokens)

For example, suppose we have the following document:

la présidente démocrate de la chambre des représentants américains, nancy pelosi, a affirmé mercredi à damas que le président syrien bachar al-assad était prêt à reprendre les pourparlers de paix avec israël, à l'issue d'une visite critiquée par le président george w. bush.

"nous avons écouté l'avis du président assad.

il a dit qu'il était pour la reprise du processus de paix au proche-orient", a dit mme pelosi qui s'est déclarée prête à promouvoir la paix entre les deux pays.

Some lines in the output might include

0010.6president

0470.8house of representatives

019230.77syrian president bachar al-Assad

1240.03have listened the

1480.08president assad ‘s advice

210120.1peace process

210120.09process of peace

In principle, English phrases can be single words, named entities, or just contiguous word sequences as in the Spanish example earlier. This section was just about input and output formats; we’ll talk more below about where the proposed English phrases might come from.

Evaluation Measures

For each sentence in the French document, a set of alternative English translations will be provided. For example, for input sentence 1 (" nous avons écouté l' avis du président Assad .) four output reference translations into English (of varying quality) might be

“ we have listened to the advice of president assad .

“ we listened to president assad ’s advice .

“ we are taking assad ’s guidance .

“ president assad ’s advice will guide us .

For every individual phrase in your system’s output table, the key question in evaluation is whether that phrase is a “match” to any reference translation – that is, whether or not it appears, exactly, in any of the reference translations. For the phrase proposed in the fourth line of output above (have listened the), the answer is no. For the phrase proposed in the fifth line (president assad ‘s advice), the answer is yes, since it appears in the second and fourth English reference translations. (As this example illustrates, matching is yes or no; you don’t get more credit for a phrase if it appears in multiple reference translations. We’ll talk more about that when we discuss MT evaluation, since the evaluation here is designed to be similar to BLEU, a score frequently used to evaluate end-to-end MT systems.)

If the lines of the output table are sorted from high to low scores, we can evaluate the output’s quality by looking at the tradeoff between yield and precision. Yield can be varied by keeping only the top k lines of the sorted table. For any yield k, the precisionwill be defined as the number of lines for which the phrase is a match in any of the reference translations for the source sentence. For example, if one were to sort the table and threshold at yield k=10, and 4 out of those 10 phrases provided a match within one of the corresponding reference translations, then precision at yield 10 would be 0.4.

Graphing precision (on the y axis) versus yield (on the x axis) will provide a picture of the tradeoff, and those graphs/tables are the result of the evaluation. In general, if scores are meaningful then one would expect to see high precision at lower yields (since low yields restrict the output to just the highest scoring proposed phrases), and lower precision as the yield increases (drawing in more proposed phrases with lower scores).

(Notice that there is no upper bound on yield. It might be interesting to normalize yield by dividing by k, but if we did that, then I think we’d lose valuable information. System A and System B could both have precision 0.7 at normalized yield 0.5, but 50% of the total might be a huge number of phrases for one system and a small number for the other.)

Baselines

One baseline approach for this task is to consider only phrases of length 1 (i.e. tokens), and to propose the “best” single word-level translation for every input token as a “phrase” covering that token. You will be provided with (or given what you need to create)a translation lexicon containing triples <score, f, e>, where score is the “goodness” of token e as a translation for French token f. (Typical ways to measure “goodness” include Pr(f|e), Pr(e|f), and the log-likelihood ratio of e and f. You might get more than one of those.)

A second baseline approach would be generalize the first baseline by using a “phrase dictionary”, and proposing the “best” English phrase for every phrase in the input sentence, if one exists in the dictionary. (There are a lot of these, of course, and they will probably overlap.)

Approaches to the Task

As I have defined them, the task and evaluation method permit an enormous number of possible ways to propose phrases. As stated in the overview, however, the goal here is to explore ways of taking advantage of information within the entire document to help inform translation of individual sentences. With that in mind, here are several approaches to consider. (Note that these are some ideas I think are worth trying, but I’m open to other possibilities. If you come up with other ideas, I strongly recommend that we discuss the proposed method before you implement it. That might be a very good idea in any case!)

TextRank WSD of words using WordNet. In Rada Mihalcea, Paul Tarau, and Elizabeth Figa, “PageRank on Semantic Networks,with Application toWord Sense Disambiguation”, COLING 2004, the authors introduce a variation on the PageRank algorithm that is used to select word senses from a sense inventory given monolingual input. The basic idea is to create a weighted directed graph where nodes are possible word senses, and edges represent relatedness in the WordNet taxonomy; PageRank is then used to assign scores to nodes in the graph. They summarize the intuition behind the approach as follows:

The meaning selected by PageRank from a set of possible meanings for a given word can be seen as the one most recommended by related meanings in the text, with preference given to the “recommendations” made by most influential ones, i.e. the ones that are in turn highly recommended by other related meanings. The meaning selected by PageRank from a set of possible meanings for a given word can be seen as the one most recommended by related meanings in the text, with preference given to the “recommendations” made by most influential ones, i.e. the ones that are in turn highly recommended by other related meanings.

Their approach could be adapted to the current problem fairly straightforwardly. Using a scored bilingual lexicon (see first baseline), one could generate nodes in a graph for all possible word translations. For example, French chambre might yield nodes labeled bedroom, room, and house, and Frenchreprésentantsmight yield a node labeled representatives and another labeled delegates. Links between nodes could come from WordNet via methods analogous to what Mihalcea et al. did, or other variations; for example, in WordNet, a direct hyponym of (one sense of) house contains the word Representatives: see

A variant on this would be to have the nodes labeled with word senses rather than words. And of course there is a huge amount of variation in the way that links and link weights could be established. Regardless, once the graph is created, the assignment of scores to the nodes is straightforward and the scored labels can be read off as hypotheses.

TextRank WSD of phrases using WordNet. Same as the above, but one could map from the source text to nodes labeled with English phrases that are present in WordNet. For example, if the dictionary maps French chambre to English house, then one could look in the phrase dictionary for all phrases containing house that are also in WordNet. This might produce house_of_representatives, house_of_commons, duplex_house, fraternity_house, etc. (depending on what’s in the phrase dictionary) as graph nodes.

Since this would likely overgenerate wildly (fraternity house?!), an alternative might be to restrict attention to WordNet phrases that contain translations of words appearing within French n-grams. For example, from chambredesreprésentants one could generate all WordNet phrases containing any pair from the cross product {bedroom, room, house} X {representatives, delegates}; this would include, e.g., house_of_representatives but not fraternity_house.

TextRank WSD of named entities using NewsExplorer database. The same basic idea can be applied using a different knowledge resource instead of WordNet. We have created a database of about 70,000 named entities and links between them by mining the EMM NewsExplorer site ( For example, if you go to the entry for John Bolton (U.S. ambassador to the United Nations, you can see that there are high weight links to the entity entries for people like George Bush and Mahmoud Ahmadinejad, organizations like United Nations, White House, and International Atomic Energy Agency, and even key titles like “ambassador” and “American ambassador”, as well as spelling variations and hyperlinks to news stories that will themselves presumably contain relevant words, phrases, and entities.

As should be evident from the discussion above, there are lots of opportunities here for a graph-based approach like TextRank. As before, one could use the phrase dictionary (either whole phrases or mapping from words to phrases containing them) to generate nodes in a graph corresponding to named entities, and then use the knowledge source, in this case the NewsExplorer database rather than WordNet, to generate weighted link.

Graph methods with vector weighting. The previous suggestions involve generating candidate nodes and deriving link weights from a knowledge structure. However, in most of the above cases, nodes in the graph can be associated with text – for example, nodes from WordNet can be associated with definitions and/or glosses (e.g. house: an official assembly having legislative powers, "a bicameral legislature has two houses”), and nodes from NewsExplorer have lots of associated text, including quotes from and about individuals and hyperlinks to related news stories. This leads to an alternative way to weight links: one could associate with each node a term vector, just like in information retrieval, constructed from the associated text or texts (stemming, stoplisting, etc.), and then use cosine between vectors as the link weight. And of course cosine-based link weights could also be combined with knowledge-based weights in some fashion, e.g. getting them on the same scale and then doing linear interpolation.

Variations on Lesk. Cosine between vectors is, if you think about it, a way of measuring text overlap. One could easily imagine generating candidates and then using variants on the Lesk algorithm in order to assign scores to them.

Stepping back.

As should be evident, a general pattern has emerged here, at least for the graph-based methods, which might be useful in guiding the architecture of your approach.

-Identify candidate English phrases (possibly single-word) from source n-grams (possibly unigrams). (Do bookkeeping to keep track of where they originated!)

-Treat each candidate as a node in a graph

-Generate weighted links, e.g.,

  • By using links in a knowledge structure (e.g. WordNet, NewsExplorer)
  • By associating text with nodes and using vector similarity

-Apply some graph-related algorithm (e.g. PageRank variants) to score nodes

-Use bookkeeping info to generate scored lines in output table

-Sort table

-Generate precision/yield tables and graphs.

Some optional related reading would include:

Agirre et al., Two graph-based algorithms for state-of-the-art WSD,

Mihalcea, R. (EMNLP 2005), Unsupervised Large-VocabularyWord Sense Disambiguationwith Graph-based Algorithms for Sequence Data Labeling

What the team should turn in

  1. A tarball of your source code, including enough information for someone to run it.
  1. A writeup that describes what you did, describing at least one baseline plus atat least two non-baseline techniques. At least one of your techniques must propose multi-word phrases rather than individual words as candidates. For each technique, include
  1. Preprocessing that was necessary
  2. Algorithmic description (clearly written text/pseudocode, example(s), etc.)
  3. Precision/yield graph(s). (It might turn out to be informative to bucket these by n-gram size, e.g. one graph each for unigrams, bigrams, trigrams, and one combined graph for 4-grams and up.)
  4. A discussion of how your techniques take advantage of document-level information in principle, illustrating with some examples of how they succeed and/or fail in doing so.

Your writeup should also include answers to the following questions:

  1. Why would it be appropriate to describe the evaluation measures here as being “precision oriented”? Give arguments for and against including a measure of recall for this task.
  1. Looking at all your approaches and their results, provide a top-level discussion of what seems to work well, what does not, and what the most attractive avenues would be for future work.
  1. If I were going to assign a project like this again in the future (perhaps not this specific problem or algorithms), what are some suggestions that would improve the design of the assignment?

Grading

The group will receive a grade-in-common out of 75 points.

For the remaining 25 points, each team member should anonymously rate each other team member as follows – making sure to look at the definitions below to see what the numeric scales are supposed to mean.

Name of person being evaluated

Collaboration: scale from 1 to 10 where 10 is highest

Contribution to success of the project:scale from 1 to 10 where 10 is highest

Effort:scale from 1 to 5 where 5 is highest

Collaboration: 10 means that this person was great to collaborate with and you’d be eager to collaborate with them again, and 1 means you definitely would avoid collaborating with this person again. Give 5 as an “average” rating for someone who was fine as a collaborator, but for whom you wouldn’t feel strongly about either seeking them out of avoiding them as a collaborator in the future.