Building Speech Synthesis for Bahasa Melayu

Text- To-Speech System

1Shanmugasilan Anantan, 1Reena Ram Tawar,

1Gunavathy Loganathan,2Natarajan Sriraam

1Undergraduate students,Faculty of information Technology,

MultimediaUniversity, 63100 Cyberjaya, Malaysia

2Center for Multimedia Computing,

Faculty of information Technology,

MultimediaUniversity, 63100 Cyberjaya, Malaysia

Tel: 603-8312-5405, Fax: 603-8312-5264

Abstract

This paper describes the development of a new text tospeech synthesis system in Bahasa Melayu (BM), Malaysian official language. The key technology for this system consists of text analysis, linguistic analysis and waveform generation. The first phase involves the generation of compact set of BM utterances that will attain a good phonetic coverage of the language. The second phase involves in constructing a speaker specific database, which starts with the recording of the speaker’s speech, modeling it using a highly efficient speech representation and segmenting into phonemes. The designed speech synthesis system result shows that, it is important to assign phrase accent for BM TTS systems from the perceptual point of view, and the that the performance of the system simulates nearly the manual annotation.

Key-words: - Speech Technology, Text- to-Speech System, Speech Synthesis, Bahasa Melayu,Diphone, Speech database.

1

1. INTRODUCTION

Recent development of speech synthesis and recognition, along with the rapid advance in computer performance, has placed the focus on speech databases [1],[3]. The quality of the database and the accuracy of the alignment in the case of alabeled speech database, are often key factors in the quality of the synthesis or recognition [2]. Text to speech synthesis is the automatic generation of speech signal, starting from a normal textual input. In recent years, theperformance of personal computers has drastically increased, which also offermultimedia facilities such as audio input/ output that make it possible to have speech output in many applications. Although various efforts have been developed to the improvement of speech synthesis technologies, TTS has not been widely accepted for BM due to the naturalness of the generated speech. Fig.1 shows the block features involved in the TTS system for Bahasa Melayu language [4]. This paper describes the development of building BM text-to-speech (TTS) synthesis system for the Malaysian language, which is based on speech segments selection andconcatenation.

Figure 1: Elements of TTS

2. SPEECH SYNTHESIS SYSTEM

The speech synthesis system for BM has been developed by using Festival Speech Synthesis System Version 1.4.3 [5]. It is a open source software that offers a general framework for building speech synthesis systems as well as including examples of various modules. The system runs on Linux Mandrake 9.0 platform. The personal computer that is being used for the development is Pentium III processor with 450 MHz, 128 MB RAM.

The text analysis capabilities of the system detect the ends of sentences, perform some rudimentary syntactic analysis, expand digit sequences into words, and disambiguate and expand abbreviations into normally spelled words which can then be analyzed by the dictionary-based pronunciation module [6],[7],[8].In order to develop a new voice in BM, the parameters need to be considered are thediphone database, a lexicon and a number of schema files that offer a complete voice. The steps involved in building speech synthesis are as follows:

  • construct basic template files
  • generate phoneset definition
  • generate diphone schema file and generate prompts
  • record speaker
  • label nonsense words
  • extract pitchmarks and LPC coefficient
  • test phone synthesis
  • add lexicon/LTS support and tokenization
  • add prosody (phrasing, durations and intonation)
  • test and evaluate voice

3. DIPHONE DATABASE

Diphones database consists of list of all possible phone-to-phone transitions in Bahasa Melayu. The idea of developing diphone database is to define classes of diphones which consists of vowel-consonant, consonant-vowel, vowel-vowel and consonant-consonant diphone types. It also includes syllabic consonant that maybe harder to pronounce in all context. Originally, there are 2551 syllables on Bahasa Melayu language including single, two letters, three letter and few four letter syllables. Fig.2 shows the possible syllables in Bahasa Melayu.

The recording environment should be reconstructable, and the condition should be as well defined as possible, as far as gain settings, and microphones distances. Every instances in a diphone database are recorded where diphone combination are required to match its corresponding Bahasa Melayu sound. To achieve a more natural voice quality, one must take more contexts into account, going beyond diphones [9]. The quality of the synthesized speech is accessed in terms of intelligibility and naturalness of pronunciation [10].

4. RECORDING PROCESS

The object of recording diphones is to get as uniform a set of pronunciations as possible. Now we have recorded the prompts after generating the prompt voices and their label files. The recording environment is reconstructable, so that the conditions can be set up again if needed. Everything is as well defined as possible, as far as gain settings, microphone distances, and so on.

The distance between the speaker and the microphone is crucial. A head mounted mike helps keep this constant. The recording process started with the prompts displayed out:When (my – 0001 (“p – i” “i – p”) (# t aa pi paa #)) is displayed, the built-in speaker pronounces the diphone once.Then, the speaker records his own voice. Fig.3 shows the example of waveform generated from the speaker.

Figure 3: Example of Waveform Generated from the Speaker

5. TESTING OF PHONE SYNTHESIS

A program to move the predicted pitchmarks to the nearest peak in the waveform is provided by:

bin/make_pm_fix pm/*.pm

Because there is often a power mismatch through a set of diphones, a simple method for finding what general power difference exists between files is provided. This finds the mean power for each vowel in each file and calculates a factor with respect to the overall mean vowel power. A table of power modifiers for each file can be calculated by

bin/find_powerfactors lab/*.lab

The factors calculated by this are saved in `etc/powfacts'. Then build the pitch-synchronous LPC coefficients, which use the power factors if they've been calculated.

bin/make_lpc wav/*.wav

Now the database is ready for its initial tests. In speech recognition a simple measure of phones or words correct gives a reasonable indicator of how well a speech recognition system works.

When there has been no hand correction of the labels this stage may fail with diphones not having proper start, mid and end values. This happens when the automatic labeled has position two labels at the same point. For each diphone that has a problem finds out which file it comes from and use emulabel, an application to change the labeling to correct. After correcting labels you must re-run the `make_diph_index' command. We also re-run the `find_powerfacts' stage and `make_lpc' stages as these too depend on the labels, but this takes longer to run and perhaps need only be done when you have corrected many labels.

Test the voice's basic functionality with

festival> (SayPhones '(# s e l a m a t #))

festival> (intro)

As the auto labeling is unlikely to work completely it is important to listen to a number of examples to find out what diphones have gone wrong. Once the errors corrected, a final voice suitable is created.

In addition to the waveform generate part, the next step is to provide text analysis for the BM language. As the relationship between BM and phones is almost trivial we write a set of letter to sound rules.

6. LEXICON LETTER TO SOUND SUPPORT

Lexicon is mainly about pronunciation of a word. The pronunciation for Bahasa Melayu language requires not just a list of phones but also a syllabic structure. In Bahasa Melayu, it is well defined and can be unambiguously derived from a phone string. The lexicon structure that is basically available in Festival takes both a word and a part of speech (and arbitrary token) to find the given pronunciation. The word itself and a fairly broad part of speech tag will mostly identify the proper pronunciation.

An example entry is:

("selamat"
nn
(((s e)0)((l a)1)((m a t)0)))

Marking syllables with a stress value is included (0 for low pitch tone and 1 for high pitch tone).

The basic assumption in Festival is that the language will have a large lexicon, tens of thousands of entries, that is a used as a standard part of an implementation of a voice. Letter-to-sound rules are used as back up when a word is not explicitly listed. In addition to a large lexicon Festival also supports a smaller list called addenda this is primarily provided to allow specific applications and users to add entries that aren't in the existing lexicon.

A very good letter-to-sound rules for Bahasa Melayu language, is needed to properly predict pronunciations of variants of root words. Letter to sound rules are used as back up when a word is not when a word is not explicitly listed. However this is a very flexible view, an explicit lexicon is not necessary in Festival and it may be possible to do much of the work in letter-to-sound rules. Due to time constraints we were unable to build a large letter to sound rules. Therefore, we have defined some main rules for the purpose of testing of the system.

In Festival there is a letter to sound rule system that allows rules to be written and provided a method for building rule sets automatically which will often be more useful. The choice of using hand-written or automatically trained rules depends on the language you are dealing with and the relationship it has between its orthography and its phone set.

The process involves the following steps:

  • Pre-processing lexicon into suitable training set
  • Defining the set of allowable pairing of letters to phones. (We intend to do this fully automatically in future versions).
  • Constructing the probabilities of each letter/phone pair.
  • Aligning letters to an equal set of phones/_epsilons_.
  • Extracting the data by letter suitable for training.

7.CONCLUSION AND FUTURE WORK

This paper describes the speech synthesis system for Bahasa Melayu based TTS system. Our evaluation results showed that minimising distortions at diphone junctions generally increased the naturalness of the output speech. The main advantage of this system is that it allows for rapid and almost automatic development of new voices or new speaker styles, according to the speech database used. Future work will be devoted to the development of new voices and speaking style, and to a better prosodic selection.

References:

[1] R.Hoory, N.Shaked and D.Chazan, Building a speech database for the purpose of speaker specific speech synthesis, International Conference on Signal Processing, 1996, pp. 741-744.

[2] A.W Black and N. Campbell, Optimising selection of units from speech databases for concatenative synthesis, Eurospeech, 1995, pp.581-584.

[3] N. Campbell and A.W.Black, Prosody and the Seelction of Source Units for Concatenative Synthesis, Progress in Speech synthesis, Springer-Verlag, New York, 1997.

[4]S.Anantan, R.Ramtawar, G.Loganathan and N.Sriraam, Bahasa Melayu based Text-to –Speech System: A feasibility study approach, International conference on

Advancement in Science and Technology, 2003, pp.45-48.

[5]P.Taylor, A.W. Black, R.Caley, The Festival Speech Synthesis System, University of Edinburgh,

[6] T. Sef, Slovenian Text-to-Speech System, Circuits and Systems - Proceedings the 2000 IEEE International Symposium Geneva, Vol. 5, 2000, pp 41-44.

[7] T. Sef, Text Analysis for the Slovenian Text-to-Speech System, Electronics, Circuits and Systems - The 8th IEEE International Conference, 2001, pp. 1355-1358.

[8]A.Syrdal, A. Bennet, S.Greenspan, Applied Speech Technology, 2nd Ed., CRC Press, London, 1985.

[9] S. Dobrisek, J.Gros, F.Mihelic,N. Pavesic, HOMER: A Voice-Driven Text-to-Speech System for the Blind, Industrial Electronics - Proceedings of the IEEE International Symposium, Vol. 1, 1999, pp. 205-208.

[10] X. Huang, A.Acero, H.Hon, Y.Ju, J.Liu,S.Meredith, M. Plumpe, Recent Improvements on Microsoft's Trainable Text-to-Speech System - Whistler, Acoustics, Speech, and Signal Processing - IEEE International Conference, Vol.2, 1997, pp 959-962.

1

Figure 2: Example of Bahasa Melayu Syllables

1