July 2007
Linguistics in Thai Speech Processing Technology
Chai Wutiwiwatchai
Human Language Technology, NECTEC
Abstract
At present, speech processing technology has becomean expected innovation in modern telecommunication. The technology spans over a variety of applications useful in human-machine interaction as well as human-human communication. It is a multi-discipline of several fields, mainly linguistics, science, and engineering. Realizing the origin of this technology, this article overviews the role of linguistics in Thai speech processing by separating into three periods; the starting period where speech were analyzed in a linguistics point-of-view, the second period where basic research were extended to advanced computational-linguistics research, and the last period where the improvement of advanced systems has become much difficult and more extensive theory and knowledge of linguistics are required. Although there exists a number of sub-technology inside the speech processing technology, this article focuses on three major issues; speech analysis, recognition, and synthesis.
1.Introduction
In past decades, speech processing technology has been widely accepted to be a potential innovation breakthrough in human-machine and human-human interactions. The technology aims to process human speech in several areas, including analyzing fundamental characteristics of speech, specific properties of speech in different languages, speech coding and compression, speech synthesis and recognition, speech quality enhancement, as well as advanced and related topics such as spoken language understanding, spoken language generation, and conversational systems. It is known that spoken language is the most basic mean of human communication as seen in many areas over the world, where no script but only spoken language is used. Research in this area not only enables communication comfortability, but also understanding of socio-geography and preservation of minor languages.
Speech processing technology is a multi-discipline that requires knowledge from several areas including linguistics and phonetics, science and engineering. Literature reviews on the technology can be found in [1, 2]. Early research begun from the linguistics field, where fundamental characteristics of speech and spoken language were studied. Such basic research have driven the engineering development of practical applications in the past few decades. Though engineering is an important component in research and development, often linguistics knowledge is indispensable. This article overviews the use of linguistics knowledge along research and development of Thai speech processing technology, which is divided into three time periods, of which the rest of this article discuss in details.
- The first period was the era where basic characterisitcs of human speech in the linguistic point-of-view were studied.Details are given in Section 2.
- Explaining in Section 3, research were extended to practical advanced systems when computers became powerful in the following era.
- At present, the last period, the performance of advanced systems has become saturated due to their technical limitations. Researchers have returned to explore more comprehensive linguistics knowledge, potential for improving systems. More details are given in Section 4.
2. Linguistics: The Origin of Thai Speech Processing Technology
Phonetics is a subject in the field of linguistics, where speech production is systematically studied. In 1947, spectrogram was invented to illustrate speech signal in a frequency domain. It was known in the liguistics community that the spectrogram can represent characteristics of speech and speaker. In 1980, a published paper [3] demonstrated the usefulness of spectrogram empirically by asking linguists to learn visually matching between spectrograms and their phoneme transcriptions.After learning on over 3,000 hours speech, transcribing only by looking at spectrograms achieved 80% agreement of different linguists. The experiment confirmed the possibility of automatic speech recognition by machines.
In the early stage of Thai speech technology, linguistics and phonetics theories played important roles in analyses of Thai speech properties, for example the characteristic of tone, filled-pause, and stress. Research were often conducted on a small set of speech samples collected regarding the particular feature being studied. Tone, a prominent characteristic of Thai, has been analyzed since the early of 20th by Bradley [4]. He demonstrated a picture of fundamental frequency (F0) curves over time for Thai five tonal syllables as shown in Figure 1. The analysis of Thai tone has been conducted subsequently in the field of linguistics [5, 6]. Most of work attempted to analyze tonal coarticulation over syllable sequences.
Figure 1: F0 curves in Thai tonal syllables.
The behavior of pauses in Thai speech has been a popular topic of linguistics studies since 1977 [7]. Luksaneeyanawin [8] analyzed pauses in Thai phrases by considering several influential factors such as phonetic contexts. In the study, it was found that approximately 24% of pauses were not detected as an acoustic period of silence and the number of written spaces in Thai script did not correspond to the number of pauses.
3.Computational Linguistics: Getting Closer to Real-World Speech
Although the basic research described for examples in the previous section were mostly performed on a limited set of speech samples, but they have served as the starting point of modern Thai speech processing technology when computers became more and more poweful. Since late 80’s, there appeared a growth of Thai speech processing. Most research were still initiated by the linguistics community, comprising analyses of speech characteristics using a larger amount of data, text-to-speech synthesis, speech recognition, and advanced applications combining speech processing systems with other levels of linguistics such as syntax and semantics. This section explains the role of linguistics in this period research and development by dividing into three aspects; acoustics, linguistics, and other related aspects.
3.1 Acoustic Aspects
In the view of speech signal, acoustic-phonetics theory is crucial for analyzing not only general, but also language-specific speech. Analyzed features can be divided into two major groups; segmental and suprasegmental features. Linguistics knowledge was employed in both groups as being described.
3.1.1 Segmentals
The phonetics science aims to study the relationship between speech production mechanism and phoneme units. Figure 2 summarizes all Thai phonemes using International Phonetics Alphabets (IPA) [9]. Luksaneeyanawin [1] has published a comprehensive description of the Thai sound system. Thai sound is often described in a syllable unit in the form of /Ci-V-CfT/ or /Ci-VT/, where Ci, V, Cf, and T denote an initial consonant, a vowel, a final consonant, and a tonal level, respectively. The Ci can be either a single or a clustered consonant, whereas the V can be either a single vowel or a diphthong. Considering the syllable structure, Thai phone units can be illustrated in Table 1. Five tones in Thai can be divided into 2 groups: the static group consists of 3 tones, the high / /, the middle / /, and the low / /; the dynamic group consists of 2 tones, the rising / / and the falling / /. Figure 1 shows a graph comparing the F0 contours for the 5 tones that appear in Thai. Recently, some loan-words which do not conform to the rules of native Thai phonology, such as the initial consonants /, , , , / and the final consonants /, , , / have begun to appear.
An amount of research exploited the Thai phone inventory based on the syllable structure in Table 1 [10, 11]. Kanokphara [10] found that such syllable structure-based unit inventory was the most efficient for Thai speech recognition, comparing to other variations of phoneme inventory such as the use of all single phonemes declared by IPA. In another similar work, Maneenoi [12] conducted a comparative experiment between a set of onset-rhyme units (a rhyme consists of a V and a Cf) and the basic phoneme unit, and found that the onset-rhyme scheme was superior to the phoneme unit scheme. Another advantage of the onset-rhyme inventory is that syllable tones are mostly expressed in the rhyme portion and thus only the rhyme units are considered when incorporating tone features.
Figure 2: Thai phonemes illustrated in IPA tables.
Table 1: Thai phone units defined according to the syllable structure.
Initial consonant, Ci / Single consonant / , , , , , , , , , , , , , , , , , , , Cluster consonant / , , , , , , , , , , ,
Vowel, V / Short vowel / , , , , , , , , , , ,
Long vowel / , , , , , , , , , , ,
Final consonant, Cf / , , , , , , , ,
Tone, T / , , , ,
During the early 2000’s, a series of research papers on Thai vowel modeling was published by Chulalongkorn University [13, 14]. Basic features such as formant frequencies were extracted and grouped according to vowels. This research provided the first analysis of Thai phonemes using a large real-speech corpus, not just a small set of sample utterances.Optimizing acoustic models under conditions where training data is sparse has also been studied. Kanokphara et al. [15] introduced some heuristic rules to an automatic dictionary generation process, due to the fact that only a small Thai speech corpus was available. Some dominant pronunciation variations of Thai, such as the ambiguity of the two phonemes // and // and the shortening of some long vowels, could be captured from the small corpus using the heuristic rules.
For Thai speech synthesis, the phone inventory defined in Table 1 was also employed. Generally, the state-of-the-art approach to waveform synthesis in TTS is unit concatenation. Speech units are selected from either a set of certain units, such as diphones and demisyllables, or directly from a large speech corpus. An articulatory synthesis method was proposed for synthesizing Thai vowels and tones [16], in which an articulatory model was constructed based on characteristics of articulated vocal-tract and sound properties such as F0. A Thai TTS system based on formant synthesis was proposed by Saiyot et al. [17]. The algorithm generated a spectrum of each phoneme using a set of rules indicating formant frequencies and magnitudes.
3.1.2 Suprasegmentals
Suprasegmentals have got interested in the field of linguistics since the first period. Most of work emphasized on various prosodic phenomena in Thai speech, especially the Thai tone. In this period of computational linguistics, studies of Thai prosody have become a major topic required in prior to the development of several advanced applications, for examples, the use of tonal features in the state-of-the-art automatic speech recognition (ASR) and durationmodeling for Thai text-to-speech synthesis (TTS). This subsection summarizes some prominent research conducted since 1990.
Tone recognition in Thai has been extensively researched at Chulalongkorn University. Tungthangthum [18] extracted pitch information from Thai monosyllables using autocorrelation and used hidden Markov models (HMM) for tone classification. Based on the analysis by Gandour et al. [19] described above, Thubthong et al. [20] introduced a novel feature called half-tone modeling, which extended an analysis frame to cover half of the neighboring frames. The model achieved promising results. Potisuk et al. [21] conducted an analysis-synthesis approach for tone recognition in continuous speech. The method exploited an extension of Fujisaki’s model to synthesize F0 contours of possible tone sequences of a given speech signal.
Over the years, there have been many attempts to utilize tones in ASR for Thai as well as for other tonal languages, such as Chinese. A paper by Demeechai and Makelainen [22] concluded that using tones in the state-of-the-art CSR could be performed in three ways: adding a set of tone features to speech feature vectors, conducting a two-pass processing separating non-tonal phoneme and tone recognition and using the tone result to rescore the N-best phoneme results, and building a one-pass decoder that computes phoneme and tone probabilities simultaneously.
Intonation is another important suprasegmental feature, the analysis of which requires different approaches, depending on whether or not the language in question is tonal or not. Potisuk et al. [23] described the difference between phrasal F0 contours in tonal and non-tonal languages. For non-tonal languages such as Japanese, the phrasal intonation is modified by F0 realization of local pitch accents, whereas for tonal languages such as Thai, lexical tones modify the overall phrasal F0 contour. Differences in the F0 movement between tonal and non-tonal languages were also extensively described by Fujisaki et al. [24]. They carried out an analysis of F0 contours in Thai utterances based on the Analysis-by-Synthesis technique. It is possible to decompose the F0 contour of a sentence into its constituents, i.e., the phrase components and the accent or local tone components. The phrase F0 contour is estimated using the Fujisaki model, while the accent or local tone components are driven by local commands.
Stress is another important language-specific suprasegmental feature. For Thai monosyllabic words, all content words are strongly stressed while all grammatical words are weakly stressed unless reinforced by emphasis. Thubthong et al. [25] and Potisuk et al. [26] investigated analyzing and modeling stress in Thai words and phrases. Prosodic features, related to F0, duration, and energy, were calculated and input to a classifier such as Artificial neural network (ANN) and Bayesian classifier.
Duration and phrasal pauses are two features important to synthesizing natural speech in the speech synthesis problem. Mittrapiyanurak et al. [27] constructed a set of rules to predict syllable duration for their TTS engine. Following this research, when a well-designed speech corpus for speech synthesis was developed a few years later [28], a comprehensive analysis of phoneme duration in Thai continuous speech was carried out. These analysis results were used to build a duration prediction model based on multiple linear regression for the TTS engine. The model computed syllable duration by considering several factors such as phoneme identities, syllable tone, word POS, and positions in the word and the phrase.In the area of phrase-break detection, A rule-based method was implemented in 2000 [27]. Later works [29] used machine learning methods to predict pauses given potential features extracted from an input sentence, such as word POS, the number of words and syllables.
3.2 Linguistic Aspects
This subsection overviews the use of linguistics mainly in text processing, which is another necessary task in speech processing research. Text processing tasks include sentence breaking, morphologicalanalysis,phonological analysis, and language modeling.
Since Thai has no explicit sentence markers, it is necessary to find the boundary of sentences. Conventionally in Thai writing, a space is placed at the end of a sentence. However, a space does not always indicate a sentence boundary. Mittrapiyanurak et al. [30] and Charoenpornsawat [31]presented algorithms to extract sentences from a paragraph by detecting sentence-break spaces. The algorithms exploitedword part-of-speech (POS) n-gram models trained by a POS-tagged corpus
Morphological analysis is a process for segmenting words into sub-word units called morphemes. The morpheme is the shortest textual chunk that still has meaning. Morphological analysis not only helps in finding the pronunciation of a word in the TTS process, but also serves as a necessary process for reducing the number of lexical words in ASR. For Thai, one basic algorithm which was devised for word segmentation was the use of a dictionary with some heuristic rules [32]. Statistical models such as part-of-speech (POS) n-gram and word n-gram have also been explored [33, 34]. These techniques worked fairly well when an input sentence contained only words appearing in the dictionary. However, a significant problem for word segmentation in Thai is when sentences contain unknown words, such as name entities and loan words written in Thai. Work by Aroonmanakun suggests that segmenting text into a sequence of syllable-like units and combining units which have high collocations can also help in these cases [35].
Phonological analysis is one approach to finding the correct pronunciation of a given text. It is sometimes called letter-to-sound or grapheme-to-phoneme (G2P) conversion. Mapping between letters and sounds in Thai is not a straight-forward process. Using only a mapping table or a pronunciation dictionary does not provide sufficient information for solving the problems presented. The early method proposed by Meknavin and Kijsirikul [36]was based on complex rules coded by linguists. To resolve ambiguous cases, statistical models such as an n-gram model [37] and a probabilistic generalized LR parser (PGLR) [11] have recently been used.Another related area deals with the mixture of Thai and English characters, which is a somewhat common phenomenon in contemporary Thai writing. Aroonmanakun et al. [38] set up an experiment which attempted to map English text to Thai phonemes using a machine learning method. The largest difficulty encountered was how to predict syllabic tones in English words. Tones (and accents) expressed in loan-words depend on several factors such as the number of syllables in each word, and this cannot be accurately predicted by simple rules.
3.3 Other Related Aspects
Natural language processing in Thai has been extensively researched in various sites. Most research works mainly focused on written text and applied to such as machine translation (Sornlertlamvanich, 1994) and information retrieval applications (Kawtrakul et al., 2000). Although the technologies created for natural language parsing can be extended to spoken language, there is only few reports of spoken Thai processing (Potisuk and Harper, 1996). Although there have been only few research exploring Thai spoken language analyses and processing, several attempts to build advanced spoken language applications have been observed, for examples, spoken dialogue systems (SDS) and speech translation.
Wutiwiwatchai and Furui[39] initiated the first complete Thai spoken dialogue system in the domain of hotel reservation and investigated a novel spoken language understanding (SLU) component, which was suitable for languages with weak grammars such as Thai. The SLU model consisted of three parts: concept extraction based on weighted finite-state automata; goal identification using a pattern classifier; and, concept-value extraction based on simple rules. It can be trained by a partially annotated corpus and hence is expected to be applicable to other languages and dialogue domains as well.Another important pioneering work proposed by Schultz et al. [40] was an English-Thai speech-to-speech translation system in the medical diagnosis domain. The paper described a two-way speech-to-speech translation system between Thai and English for dialogues in the limited medical domain, where the English speaker was a doctor and the Thai speaker was a patient. The system consisted of three major parts, a domain-specific ASR engine, an interlingua language translation system, and a speech synthesizer.