The Russian Speaker Independent Consecutive Speech Decoder

Alexander A. Kibkalo, Ignat G. Rogozhkin

VNIIEF-STL, Russia, Nizhegorodsky reg., Zernova 53

Abstract. In this article we describe the large vocabulary Russian consecutive speech decoder based on SDT software package. The decoder shows state-of-the-art performance and real-time speed. Novel techniques for dictionary size reduction and statistical language modeling are used. The results of consecutive Russian speech decoding for traditional and proposed approaches are presented. The proposed approach is applicable for all Slavic languages.

INTRODUCTION

A phonetic dictionary is an important part of any speech decoder. In the simplest case it is a list of all recognizable words written in a phonetic alphabet (Fig. 1). A decoder usually uses more compact form of the phonetic dictionary such as a prefix tree (Fig.2) [1]. Nodes of this tree contain monophones (phones contained in the phonetic alphabet) or triphones (the same phones with context).

FIGURE 1. Phonetic dictionary fragment.

FIGURE 2. Triphone tree representation of the phonetic dictionary fragment.

In practice acoustic models built with triphones present better quality of recognition in comparison with acoustic model built with monophones. So, in most modern consecutive speech decoders triphones are used for acoustic modeling. And the triphones tree is preferable form of the phonetic dictionary. The monophones tree as a phonetic dictionary is possible for the acoustic models built with triphones but additional time for extracting a context from parent and children nodes of the phonetic dictionary is required.

For example, for the classic task Wall Street Journal 1992-1993 [2] the phonetic dictionary contains about 23700 pronunciations of words written in the phonetic alphabet. This dictionary form contains about 175000 phonetic symbols (include fake symbols at the end of words). The triphones tree for this dictionary contains about 120000 nodes and the monophones tree – about 85000 nodes. And this triphones tree takes about 30Mb of RAM.

The same dictionary for Russian language would have more than 700000 nodes because the average Russian lemma has about 40 wordforms. It means that this tree will take approximately 180Mb of RAM. In the decoding process this memory is used frequently so the Russian decoder with traditional tree (as a phonetic dictionary form) will work at a personal computer very slowly.

THE RUSSIAN PHONETIC DICTIONARY REPRESENTATION

In Russian many words have many different forms. Grammar form in Russian is very important because for many words there are different wordforms with the same spelling and pronunciation. So novel technique is based on grammar forms selection of the words. According to this technique the new prefix tree is built only for lemmas and all grammar forms endings are represented in a form of another prefix tree (the subtree of endings) (Fig. 3). The leaf nodes of the prefix tree for lemmas refer to the appropriate ending subtrees. Russian word-building follows to widely used ending paradigms, so this technique helps to reduce dictionary size and makes it similar to the appropriate English dictionary.

FIGURE 3. The phonetic dictionary WSJ fragment for lemma “act” and it’s ending paradigm.

Phonetic dictionary construction is the complicated task and it deviates from this article [3]. For example, this technique reduces the size of prefix tree for the Russian dictionary based on A.A.Zaliznyak’s grammar dictionary more than five times.

This phonetic dictionary representation form was tested on the WSJ task. The ending subtrees for WSJ dictionary words were selected arbitrarily and mainly by spelling similarity. As a result 49 different ending subtrees were selected for spelling. It involved construction of 100 different ending subtrees for pronunciation.

After dictionary restructuring the number of entries reduced on 35% and the size of triphone prefix tree – on 28%. The average number of endings in a subtree was approximately 5.5. This reduction technique will be much more efficient for Russian dictionary because Russian lemmas has ending have from 10 to 50 endings

THE MODEL OF SPEECH DECODER

The decoding process is the search for the best sequence of triphones and in case of representing the phonetic dictionary as a triphones tree it means the search for the best path in this tree. The most popular method for this search is the method of token propagation [4] through this tree. Each path corresponds to the token located in the node ending this path and containing all information about this way including the probability of this path. When a path is continued the corresponding token moves to the next available nodes multiplying the path probability by transition and observation probabilities. If more than one token are located in a node tokens with smaller probability could be pruned. If a token reaches a leaf node of the tree (WordEnd-node) the corresponding word is added to a path and the path probability is multiplied by the language model probability. After that the token is propagated to the tree root and the decoding process continues.

The decoding process becomes more complex if the new technique is used. In original decoder the within-word method of token propagation was used – the context for tokens was completely determined in the prefix tree representing the whole phonetic dictionary. After using the endings selection technique there is the main tree of lemmas with leaf nodes referencing to ending subtrees. So it is not always possible to determine the whole phonetic context within the main tree of lemmas or within ending subtrees. For example, when a token locates near the WE-nodes of the main tree or in the first level of an ending subtree the phonetic context is determined dynamically. For the last phone of the lemma each token is split according to different first phones of referenced ending subtrees (i.e. possible right contexts). For the first phone of an ending subtree the left context should be chosen according to the previously passed node. So, the crossword method of token propagation is required for the transition between the main tree and ending subtrees. It requires adding some information to a token. The method of using the language model was changed, too. Two statistical language models – the first model for lemmas and the second model for all wordforms – are used with the new technique. The first language model is applied when a token reaches the WE-node of the main tree and the second language model – after reaching the WE-node of the ending subtree. Such dictionary organization allows using lemmas’ language model look-ahead. In the WE-node of the main tree look-ahead probability is changed to the full language model probability.

TESTING RESULTS

The consecutive speech decoder for the phonetic dictionary representation was created on the base of SDT software package [5]. Comparative results of new and original decoders on the classic English task WSJ 1992-1993 are presented in the table 1. The computer with Pentium 4â


TABLE 1. Comparative results of SDT-based speech decoders.

Decoder type
/
Word errors
/ Speed ´RT
Original / 10.2% / 1.1
With new technique / 10.2% / 1.22

processor, 1400 MHz, RAM 512 K was used. These results show that the novel technique will be perspective for decoding of any language with high number of wordforms.

CONCLUSION

The new method of phonetic dictionary organization in a form of prefix tree of lemmas and ending subtrees helps to reduce greatly the size of internal dictionary representation for consecutive speech decoder. It resolves one of the main problems for creation of the high quality Russian speech decoder. This method allows using a huge dictionary (about one million words) in real time.

This technique was successfully modeled for the English dictionary on the task Wall Street Journal. The testing shows insignificant slowdown of recognition speed (about 10%) with the same recognition quality. The proposed phonetic dictionary representation helps efficiently using the statistical language model of Russian language with low perplexity [6].

ACKNOWLEDGEMENTS

This work was funded by VNIIEF-STL company.

REFERENCES

[1]Zhao,Q.,Lin,Z.,Yuan,B. and Yan,Y., ”Improvements in Search Algorithm for Large Vocabulary Continuous Speech Recognition”, Proceedings of ICSLP’2000, Beijing, 2000.

[2]Paul,D.B. and Baker,J.M., “The design of the Wall Street Journal-based CSR Corpus”, Proceedings of ICSLP’92, Kobe, 1992.

[3]Kibkalo,А.А. and Turovets,А.А., “Basic Pronouncing Dictionary for Russian Speech Decoding ”, Proceedings of the Workshop SPECOM’2003, Moscow, 2003.

[4]YoungS.J.,Russell,N.H and Thornton,J.H.S., ”Token Passing: a Simple Conceptual Model for Connected Speech Recognition Systems”, Technical Report CUED/FINFENG/TR38, Cambridge University Engineering Dept., 1989.

[5] Barannikov, V.A and Kibkalo, A.A., “SDT 4.0 – Software Package for Continuous Speech Recognition”, Proceedings of the Workshop SPECOM’2003, Moscow, 2003.

[6]Kholodenko,А.Б., “O postroenii statisticheskih yazykovyh modelei dlya system raspoznavaniya slitnoi russkoi rechi”, Intellektual’nye Sistemy (Rus.), vol 6, 1-4, MSU, Moscow, 2001.

Русский дикторонезависимый декодер слитной речи

Александр А. Кибкало, Игнат Г. Рогожкин

ВНИИЭФ-СТЛ, Россия, Саров,Нижегородская обл., Зернова 53, 607190

Abstract. В данной статье рассматривается русский декодер слитной речи с большим словарём, базирующийся на пакете программ SDT. Декодер демонстрирует высокую точность распознавания при работе в режиме реального времени. В декодере применены новые технологии, уменьшающие объём словаря и статистической языковой модели. Приведены сравнительные результаты для русского декодера слитной речи в обычном варианте и с использованием новых технологий. Предложенные нововведения применимы для всех языков славянской группы.