A Co-Chunk based method

for Spoken-language Translation[1]

CHENG Wei[2], ZHAO Jun, LIU Feifan and XU Bo

National Laboratory of Pattern Recognition

Institute of Automation, Chinese Academy of Sciences, Beijing, China

, , ,

ABSTRACT

Chunking is a useful step for natural language processing. The paper puts forward a definition of co-chunks for Chinese-English spoken-language translation, based on both the characteristics of spoken-language and the differences between Chinese and English. An algorithm is proposed to identify the co-chunks automatically, which combines the rules into a statistical method and makes a co-chunk has both syntactical structure and perfect meaning. Using the co-chunk alignment corpus, we present the framework of our translation system. In the framework, the word-based translation mode is employed to smooth the co-chunk-based translation model. A series of experiments show that the proposed definition and the co-chunking method can lead to great improvement to the quality of the Chinese-English spoken-language translation.

Keywords

chunking; spoken-language translation; statistical machine translation.

1.  Introduction

It is well known that speech-to-speech translation faces more problems in comparison with pure text translation, such as:

²  More irregular spoken utterances: There are much more pauses, repetitions, omitting etc. in spoken language.

²  More flexible speech styles: slow or rapid speeches with different stresses, accent, appears.

²  No punctuation to segment the utterances.

Confronted with these problems, more robust technologies are needed to be developed to achieve an acceptable performance in the spoken-language translation system. In recent years, some data-driven methods are taken as the effectual ways for machine translation, such as the example-based machine translation (EBMT, proposed by ATR) and the statistical machine translation (SMT). The statistical approach is an adequate framework for introducing automatic learning techniques in spoken-language translation. It has been studied for many years[1][2][3][4][5]. However, its performance isn’t very satisfactory[6].

In this paper, we introduce text chunking into the SMT model to improve the translation quality. Chunking is a useful step for natural language processing. There are many researches dealing with chunk parsing for single language[7][8][9]. However, in machine translation, we need a definition correlative with both the source language and the target language. Therefore, we first present the co-chunk definition for Chinese-English spoken-language translation. Then a co-chunking method based on the definition is investigated. Finally a SMT system based on co-chunks is built to improve the translation quality.

The paper is organized as follows. Section 2 describes the definition and the features of the co-chunk. Section 3 presents an automatic algorithm for the co-chunk identification. And section 4 presents a statistical translation framework based on the co-chunk. In section 5 experimental results are presented and analyzed. Some remarks are given in section 6.

2. Definition of co-Chunks

In this paper, a co-chunk is composed of a source sub-chunk and a target sub-chunk. Each of them has both the syntactic structure and the low ambiguous meaning. The definition can be described by the following formula:

(1)

Where, BC denotes a set of co-chunks. “bs” is the source sub-chunk and l is its length. wsi is a word in the source sentence. “bt” is the target sub-chunk and m is its length. wti is a word in the target sentence. NS is the number of source sub-chunks in the source sentence. And NT is the number of target sub-chunks in the target sentence. The detailed explanations are as follows.

1)  Structure. The sub-chunk is defined as a syntactic structure which can be described as a connected sub-graph of the sentence’s parse-tree. None of them in a sentence overlaps each other.

2)  Meaning. The typical sub-chunk consists of a single content word and its contextual environment. Therefore the meaning of the sub-chunk is the less ambiguous. This definition can be used for disambiguation in the machine translation.

3)  Transition. Meanwhile, the meaning of the target sub-chunk should be as the same as that of the corresponding source sub-chunk, except that a source sub-chunk is corresponding to a null target sub-chunk and vice versa.

An example of Chinese-English co-chunk is given in figure 1. From it we can see some features of the co-chunk:

Fig. 1 An example of Chinese-English co-chunk

1)  It builds the semantic relation between two languages.

2)  It keeps one of the characteristics in the most definitions of monolingual chunk, that is, chunks have a legal syntax structure. Therefore, we can use the shallow analysis to extract the co-chunk.

3)  It integrates the syntactic rules of two different languages. Here we define 8 kinds of basic sub-chunks for Chinese as noun sub-chunk, verb sub-chunk, interrogative sub-chunk, adjective sub-chunk, preposition sub-chunk, adverb sub-chunk, modal/punctuation sub-chunk and idiom sub-chunk. While in English, a SBAR sub-chunk is added and the modal/punctuation sub-chunk is renamed as the interjection sub-chunk. The definitions of all kinds of the sub-chunk are according to the character of both the Chinese and the English.

3. The automatic Identification of CO-chunks

Figure 2 gives the process of the automatic identification of the co-chunks. It includes three parts: 1) source chunking, 2) searching the target chunks according to the source chunks, 3) proof-checking of the co-chunks.

Fig. 2 Structure of the identification system for co-chunks

3.1. Searching for co-chunk

The finite state machine (FSM) can be employed in the stages of source chunking and proof-checking. Dynamic programming together with heuristic function is used in the searching for the co-chunks. The search algorithm is as follows.

1)  OPEN := (s), g(s) := 0;

2)  LOOP: IF OPEN = () THEN EXIT (FAIL);

3)  n := FIRST(OPEN);

4)  IF END OF SENTENCE THEN EXIT (SUCCESS);

5)  REOMOVE (n, OPEN), ADD(n, CLOSED);

6)  EXPAND(n)--->{ml}. ;

7)  IF CHUNK(ml) follows syntactic rules, ADD(ml, OPEN), and tag POINTER(ml, n);

8)  SAVE min f(PATHi), SORT(NODEj);

9)  GOTO step 2).

3.2. Calculation algorithm

We define

(2)

Where, bsk is the source sub-chunk of the kth co-chunk and btk is the target sub-chunk of the kth co-chunk. The objective of the search can be described as

(3)

According to Bayesian formula

(4)

Where p(bsk) and p(btk) can be estimated by the bigram language model as

(5)

p(bsk|btk) is the translation probability of the source sub-chunk on condition that the target sub-chunk occurs.

(6)

Where, can be estimated by EM algorithm[10]. is the probability of length and can be estimated by Possion distribution.

Hence, estimation from start node to middle node k is

(7)

On the other hand,

(8)

Thus, we get

(9)

4. the co-chunk based translation

Fig.3 gives the structure of the translation system based on the co-chunks. It includes two steps:

1)  Training: First, some preprocessing steps are applied to the Chinese-English corpus, such as, sentence segmentation and word segmentation. Then the identification system is employed to identify the co-chunks in the corpus automatically. Therefore, the statistical models can be trained with the co-chunk-based corpus.

2)  Translation: This step consists of the chunk matching and the translation decoding. The chunk matching is similar to Chinese word segmentation. It can be done by the maximum matching algorithm according to a Chinese chunking corpus. The translation decoding is the co-chunk-based SMT whose unit is not a word but a co-chunk.

Fig. 3 Structure of the SMT system based on co-chunks

4.1. Co-chunk-based translation model

In statistical opinions, translation task can be described as follow. Given a source (“ Chinese ”) string , we choose the string E* among all possible target (“ English ”) strings with the highest probability that is given by Bayes’ decision rule [1]

(10)

This is the word-based SMT approach. is the probability of the language model produced by the target language. is the probability of the string translation model from the target language to the source language. The argmax operation denotes the decoding problem, i.e. the generation of the output sentence in the target language.

Then we define the sentences as

Where, bcj is a Chinese chunk, bei is an English chunk. J is the co-chunk number in the Chinese sentence. It is the co-chunk number in the English sentence. Because a source sub-chunk can correspond to a null target sub-chunk, J isn’t always as same as I. Then the equation 10 can be rewritten as

(11)

As the word-based SMT, is the probability of the co-chunk language model. is the probability of the co-chunk translation model.

4.2. Smoothing

Because the unit number of co-chunk-based system is larger than that of the word-based system, the data sparseness problem is a severe problem for the co-chunk-based translation. That is to say, it needs to be smoothed in both the co-chunk language model and the co-chunk translation model.

In our system, the trigram model is used as the co-chunk language model.

(12)

And its smoothing algorithm is the back-off method.

Moreover, the co-chunk translation model just likes the model 1 of IBM[10].

(13)

presents some small, fixed number. is the translation probability of bcj given bei. It can be estimated from the EM algorithm. And we smooth this model according to the word-based translation model.

(14)

In our system, a dynamic programming algorithm is used as the decoding method which is the same as the fast stack decoder[11].

5. EXPERIMENTS AND DISCUSSION

5.1. Experiment of the co-chunk identification

In this section, some results of the automatic identification system for the co-chunks are presented. A corpus of 66061 sentence pairs is used to train the parameters. A close test set includes 2487sentences. And the open test set includes 845 sentences. The precision and the recall is defined as

(15)

Where is the co-chunk number of the identification result. is the co-chunk number of the answers. And is the co-chunk number of the right identification.

Table 1 Results of the co-chunk identification

Test set / Closed Test / Open Test
Precision (%) / 83.86 / 81.20
Recall (%) / 84.65 / 81.19

Table 1 shows that the automatic identification method can deal with parallel corpus effectively. The following are some analysis and advices for improving the performances.

1)  Accuracy rate and callback rate reach 84.5% simultaneously for close testing. In the open test its performance degrades to about 80% which is still attractive for machine translation.

2)  Most errors are caused by mapping errors between Chinese chunks and English chunks.

3)  Probability parameters are not accurate enough, due to the sparse training data we used. It is another source of mapping errors.

4)  Error rate can be alleviated if more training data are employed.

Table 2 Examples of the identification results

Examples
麻烦 您 (4)|| 把 预约 (3)|| 推迟 (2)|| 到 三 天 后 (1)|| 。
please (4)|| postpone (2)|| my reservation (3)|| for three days (1)|| .
预定 (10)|| 是 (9)|| 住 (8)|| 两 个 晚上 (7)|| , (6)|| 但 (5)|| 想 (4)|| 改为 (3)|| 住 (2)|| 三 个 晚上 (1)|| 。
I (4)|| had (8)|| a reservation (10)|| for (2)|| two nights (7)|| , (6)|| but (5)|| please (-1)|| change (3)|| it (9)|| to three nights (1)|| .
我 (7)|| 今天 (6)|| 订 了 房间 (5)|| 但是 (4)|| 突然 (3)|| 有 了 (2)|| 急事 (1)|| 。
I (7)|| have a reservation (5)|| for tonight (6)|| but (4)|| due to (2)|| urgent business (1)|| I am unable (3)|| to make it (-1)|| .

Three examples of identification results are laid out in Table 2. The numbers in the table are the specific number of the co-chunks in the sentences.

5.2. Experiment of the co-chunk-based translation

These experiments are carried out on a Chinese-English parallel corpus. The corpus consists of spontaneous utterances from hotel reservation dialogs. Although this task is a limited-domain task, it is difficult for several reasons: first, the syntactic structures of the sentences are less restricted and highly variable; second, it covers a lot of spontaneous speech characters, such as hesitations, repetitions and corrections. The summary of the corpus is given in the tables 3.

Table 3 Training corpus

Chinese / English
Sentences / 2655
Vocabulary Size / 1237 / 932
Chunk List Size / 2785 / 1775

The system is tested by the test set of 1000 sentences and evaluated by both subjective judgments and the automatic evaluation algorithm.

1)  Subjective judgment. The performance measure of the subjective judgment is the indication of the closeness of the output to the original with four grades: (A) All contents of the source sentence are conveyed perfectly. (B) The contents of the source sentence are generally conveyed, but some unimportant details are missing or awkwardly translated. (C) The contents are not adequately conveyed. Some important expressions are missing and the meaning of the output is not clear. (D) Unacceptable translation or no translation is given.

2)  Automatic evaluation. An automatic evaluation approach is employed to measure the output quality of the spoken-language translation. The equation 16 describes its final score. And the detail is in the reference [12].

(16)

Table 4 shows the results of the examination. From it we can see:

1)  Co-chunk-based model outperforms word-based alignment model significantly.

2)  In spoken language, the processing unit for human maybe is chunks rather than words.

3)  By formalizing the co-chunks definition, it is possible to find the better balance point of the statistical and rule-based methods.

Table 4 Results of the co-chunk-based translation

training corpse / Automatic evaluation / Subjective judgments (%)
A / B / C / D
word-based / 0.589 / 29.2 / 22.9 / 33.3 / 14.6
Co-chunk-based / 0.794 / 66.7 / 22.9 / 10.4 / 0.01

Three examples of the experiments are laid out as follows. Here, <c> is the Chinese sentence; <tw>is the translation result of the word-based system; and <tb> is the translation result of the co-chunk-based system.

Exp1: <c> 靠 河边 风景 漂亮 的 房间 有没有 ?

<tw> any of the river from the room good view

<tb> are there any rooms with a good view of the river ?

Exp2: <c> 没有 收到 日本 佐藤 来 的 房间 预约 吗 ?

<tw> [Fail. No translation]

<tb> in the name of Sato from Japan ?

Exp3: <c> 只要 带有 淋浴 的 房间 都 行 。

<tw> is that all the rooms with a shower .

<tb> all the rooms with a shower will be fine .