CONTENTS
INTRODUCTION
Properties of Speech Signals
ITCS MECHANISM AND CLASSIFICATION
WHAT IS VOICE MORPHING
Prototype waveform interpolation
PWI-based speech morphing
The basic algorithm
Computation of the characteristic waveform surface
New vocal tract model calculation and synthesis
CONCLUSIONS
REFERENCES
Introduction
Voice Morphing
Speech signals convey a wide range of information. Of this information the most important is the meaning of the message being uttered. However, secondary information such as speaker identity plays a major role in oral communication. Voice alteration techniques attempt to transform the speech signal uttered by a given speaker so as to disguise the original voice. In addition to that, it is possible to modify the original voice to sound like another speaker, the target speaker. This is generally known as voice morphing. There has been a considerable amount of research effort directed at this problem due to the numerous applications that this technology has. Some examples of that will benefit from successful voice conversion techniques are,
- Customization of text-to-speech systems e.g. to speak with a desired voice or to read out email in the sender’s voice .This could also be applied to voice individuality disguise for secure communications or voice individuality restoral for interpreting telephony.
- In the entertainment industry a voice morphing system may well replace or enhance the skills involved in producing sound tracks for animated characters, dubbing or voice impersonating.
- In the Internet “chat rooms”, communication can be enhanced by a technology for disguising the voice of the speaker.
- Communication systems can be developed that would allow speakers of different languages to have conversations. These systems will first recognize the sentence uttered by each speaker and then translate and synthesize them in a different language.
- Assisting hearing impaired persons. People with a hearing loss in the high frequency sounds can benefit from a device that can change appropriately the spectral envelope of the speech signal
1 | Page
Approach to the problem
It has been recognized over the years that voice individuality is a consequence of combining several factors. Among these factors segmental speech characteristics such as the rate of speaking, the pitch contour or the duration of the pauses have been shown to contribute greatly to speaker individuality. Furthermore, it has been shown that the linguistic style of the speech that is determined by the rules of language has a great influence on the voice features. In the current state of research the processing of these features of speech byan automatic system is very difficult because high-level considerations are involved. Fortunately, strong experimental evidence has indicated that distinct speakers can be efficiently discriminated at the segmental level by comparing their respective spectral envelopes. It is generally admitted that the overall shape of the envelope together with the formant characteristics ,the nature of which will be explained later on, are the major speaker-identifying features of the spectral envelope.
The method we will use for the work of this dissertation is based on the codebook mapping approach that has been the subject of extensive research over the years with respect to voice conversion. This method is based on creating “dictionaries” of either words, sounds or any chosen speech segment that have the acoustical characteristics stored as codebook vectors or matrices maps them by replacing them on a speaker-to-speaker basis. Furthermore, the analysis will focus solely on the transformation of the speaker’s formant characteristics.
Voice conversion will be performed in two phases. In the first phase, the training, the speech signals of the source and target speakers will be analyzed and the voice characteristics will be extracted by means of a mathematical optimization technique, very popular in the speech processing world, the Linear Prediction Coding (LPC) technique. In order to increase the robustness of the conversion, the high-dimensional speech spaces, where the voice characteristics lie, will be described by a continuous probability density corresponding to a parametric constrained Gaussian Mixture Model that is topologically orientated, by means of the Generative Topographic Mapping (GTM). This model will map the high dimensional data to a 2-D space and at the same time make possible a transformation that will take place at the distribution level rather than transforming the data directly. In the transformation stage, the acoustic features of the source signal as they are represented by the GTM will be transformed to those of the target speaker by means of codebook mapping. That is, by a one to one correspondence, at the distribution level, between the codebook entries of the source speaker and the target speaker. Finally, the transformed features will be used in order to synthesize speech that will, hopefully, resemble that of the target speaker. Speech synthesis will be performed again by means of the Linear Prediction Coding.
1 | Page
Figure to representVoice morphing implementation
It is agreed by most authors that the mapping codebook approach, although it provides a voice conversion effect that is sometimes impressive, is plagued by its poor quality and its lack of robustness. We do, however, believe that the topologically orientated parametric model that we will use in this dissertation in order to describe our speech space may increase the robustness of the transformation and produce speech of improved quality in terms of resembling the target speaker.
Properties of Speech Signals
In order to have a good model for representing the speech signal, we need to have a good understanding of the process of speech production. In the following section, we present a concise description of the anatomy and physiology of speech production.
The mechanism of speech production
Anatomy and Physiology of the Human Speech Production
The speech production apparatus is comprised of three major anatomical subsystems: The respiratory or sub glottal, the laryngeal and the articulatory subsystem. Figure 2.1 depicts the speech production system. The respiratory subsystem is composed of the lungs, trachea and windpipe, diaphragm and the chest cavity. The larynx and pharyngeal cavity or throat constitutes the laryngeal subsystems. The articulatory subsystem includes the oral cavity and the nasal cavity. The oral cavity is comprised of the velum, the tongue, the lips, the jaw and the teeth. In speech processing technical discussions, the vocal tract is referred to as the combination of the larynx, the pharyngeal cavity and the oral cavity. The nasal tract begins at the velum and terminates at the nostrils.
The respiratory subsystem behaves like an air pump, supplying the aerodynamic energy for the other two subsystems. In speech processing, the basic aerodynamic parameters are air volume, flow, pressure and resistance. The main contribution of the respiratory subsystem for speech production is that when a speaker inhales air by muscular adjustments causing an increase in volume of the sub glottal system, then the lungs release air by a combination of passive recoil and muscular adjustments. Speech is the acoustic wave that is radiated from the sub glottal system when air is expelled from the lungs. The laryngeal subsystem acts as a passage for air flow from the respiratory subsystem to the articulatory subsystem. In the laryngeal subsystem, the larynx consists of various cartilages and muscles. For speech production, of particular importance are a pair of flexible bands of muscle and mucus membrane called vocal folds. The vocal folds vibrate to lend a periodic excitation for the production of certain speech types that will be discussed in the next subsection. The vocal folds come together or separate to respectively close or open the laryngeal airway. The opening between the vocal folds is known as the glottis. The articulatory subsystem stretches from the top of the larynx up to the lips and nose through which the acoustic energy can escape.
The articulators are movable structures that shape the vocal tract, determining its resonant properties. This subsystem also provides an obstruction for some cases or generates noise for certain speech types.
Figure: Cross-section of the human vocal system
Classification of speech sounds
Speech signals are composed of a sequence of sounds. These sounds and the transitions between them serve as a symbolic representation of information. Phonemic content is the set of the characteristics found in an acoustic signal that would make a listener distinguish one utterance from another. It is expressed in terms of phonemes that are mainly classified as vowels, consonants, diphthongs and semivowels. English uses about 40 phonemes. The study of the classification of the sounds of speech is called phonetics. A detailed discussion of phonetics is beyond the scope of this chapter. However, in processing speech signals it is useful to know how speech signals are classified as this affects the speech processing technique.
Speech sounds can be classified into three distinct categories according to their mode of excitation: voiced, unvoiced and plosive sounds. For the production of voiced sounds the lungs press air through the glottis, the vocal cords vibrates, they subsequently interrupt the air stream and produce a quasi-periodic pressure wave. These pressure impulses are commonly called pitch impulses and the frequency of the pressure signal is the fundamental frequency. All vowels are voiced sounds. For the production of unvoiced or fricative sounds air flow from the lungs becomes turbulent as the air is forced through a constriction at some point in the vocal tract at a high enough velocity. This creates a broad-spectrum noise source to excite the vocal tract. An example of a fricative sound is ’sh’ in the word ’shall’. Plosive sounds result from making a complete closure, building up pressure behind the closure and abruptly releasing it. The rapid release of this pressure causes a transient excitation.
An example of a plosive sound is ’b’ in the word ’beer’. Therefore, for voiced sounds the excitation is in the form of a periodic train of pulses whereas for unvoiced sounds the sound is generated by random noise.
The resonant properties of the vocal tract mentioned earlier are the characteristic resonant frequencies. Since these resonances tend to ’form’ the overall speech spectrum, we refer to them as formant frequencies. Formant frequencies usually appear as peaks in the speech spectrum.
Figure: Approximation of vocal tract by lossless tubes
The cross-sectional areas of the tubes are chosen so as to approximate the area function of the vocal tract. If a large number of tubes of short length is used, we can reasonably expect the formant frequencies of the concatenated tubes to be close to those of a tube with a continuously varying area function. The most important advantage of this representation is the fact that the lossless tube model provides a convenient transition between continuous time models and discrete time models
Speech production model
The vocal system experiences energy losses during speech production because of several factors including losses due to heat conduction and viscous friction at the vocal tract walls, radiation of sound at the lips etc. Many detailed mathematical models have been developed that describe each part of the human speech production procedure. These models for sound generation, propagation and radiation can in principle be solved with suitable values of the excitation and vocal tract parameters to compute an out-put speech waveform. Although this may be the best approach to the synthesis of naturally sounding synthetic speech, such detail is often impractical or unnecessary. The approach we will use is a rather superficial, yet efficient, one and examines a model for the vocal tract solely. The most common models that have been used as the basis of speech production separate the excitation features from the vocal tract features. The vocal tract features are accounted for by the time-varying linear system. The excitation generator creates a signal that is either a train of pulses (for voiced sounds) or randomly varying (for unvoiced sounds). The parameters of the model are chosen such that the resulting output has the desired speech-like properties.
The simplest physical configuration that has a useful interpretation in terms of the speech production process is modeling the vocal tract as a concatenation of uniform lossless tubes. Figure 2.3 shows a schematic of the vocal tract and its corresponding approximation as a concatenation of uniform lossless tubes.
Human speech production model
Let us resume what we have learned so far. Sound is generated by two different types of excitation and each mode results in a distinctive type of output. Furthermore, we have learned that the vocal tract imposes its resonances upon the excitation so as to produce the different sounds of speech. Based on these facts and after extensive model analysis it has been concluded that the lossless tube discrete-time model can be represented by a transfer function H (z) .Here and depend upon the area function of the vocal tract, is the z-transform representation and is the number of lossless tubes of equal length the concatenation of which we assume to approximate the vocal tract we are modeling. The poles of the system are the values of the z-transform for which. As the transfer function has the Z-transform components in the denominator only we can say that it is characteristic of an all-pole digital filter. We can thus say that a speech signal can be produced when an excitation signal
1 | Page
Filter representation of speech production
1 | Page
The Challenge
In order to perform voice morphing successfully, the speech characteristics of the source speaker’s voice must change gradually to those of the target speaker’s; therefore, the pitch, duration, and spectral parameters must be extracted from both speakers. Then, natural-sounding synthetic intermediates have to be produced.
What is Voice Morphing?
Voice morphing is a technique for modifying a source speaker’s speech to sound as if it was spoken by some designated target speaker.
Research Goals: To develop algorithms which can morph speech from one speaker to another with the following properties
1. High quality (natural and intelligible)
2. Morphing function can be trained automatically from speech data which may or may not require the same utterances to be spoken by the source and target speaker.
3. The ability to operate with target voice training data ranging from a few seconds to tens of minutes.
Key Technical Issues
1. Mathematical Speech Model
• For speech signal representation and modification
2. Acoustic Feature
• For speaker identification
3. Conversion Function
• Involves methods for training and application
Pitch Synchronous Harmonic Model
Sinusoidal model has been widely used for speech representation and modification in recent years.
- PSHM is a simplification of the standard ABS/OLA sinusoidal model
- The parameters were estimated by minimizing the modeling error
= s~(n) ]2
Time and Pitch Modification using PSHM
Pitch Modification
• It is essential to maintain the spectral structure while altering the fundamental frequency.
• Achieved by modifying the excitation components whilst keeping the original spectral envelope unaltered.
Time Modification
• PSHM model allows the analysis frames be regarded as phase-independent units which can be arbitrarily discarded, copied and modified.
Super segmental Cues
• Speaking rate, pitch contour, stress, accent, etc.
• Very hard to model
Segmental Cues
• Formant locations and bandwidths, spectral tilt, etc.
• Can be modeled by spectral envelope.
• In our research, Line Spectral Frequencies (LSF) are used to represent the spectral envelope.
1 | Page
Basic Block Diagram
Prototype waveform interpolation
PWI is a speech coding method.This method is based on the fact that voiced speech isquasiperiodic and can be considered as a chain of pitch cycles.Comparing consecutive pitch cycles reveals a slow evolutionin the pitch-cycle waveformand duration
PWI-based speechmorphing
Prototype waveform interpolation is based on the observationthat during voiced segments of speech, the pitch cyclesresemble each other, and their general shape usually evolvesslowly in time (see [16, 17, 18]). The essential characteristicsof the speech signal can, thus, be described by the pitchcyclewaveform. By extracting pitch cycles at regular time instants,and interpolating between them, an interpolation surfacecan be created. The speech can then be reconstructedfrom this surface if the pitch contour and the phase function(see Section 2.3.1) are known.The algorithm presented here is based on the sourcefiltermodel of speech production [19, 20]. According to thismodel, voiced speech is the output of a time-varying vocaltractfilter, excited by a time-varying glottal pulse signal. Inorder to separate the vocal-tract filter from the source signal,we used the LPC analysis [21], by which the speech isdecomposed into two components: the LPC coefficients containingthe information of the vocal tract characteristics, andthe residual error signal, analogous to the derivative of theglottal pulse signal. In the proposed morphing technique, weused the PWI to create a 3D surface from the residual errorsignal which would represent the source characteristics foreach speaker. Interpolation between the surfaces of the twospeakers allows us to create an intermediate excitation signal.In addition to the fact that the information of the vocal tract(see Section 2.3.3) is manipulated separately from the informationof the residual error signal, it is also moreadvantageousto create a PWI surface from the residual signal thanto obtain one from the speech itself. In this domain, it is relativelyeasy to ensure that the periodic extension procedure(see below) does not result in artifacts in the characteristicwaveform shape [16]. This is due to the fact that the residualsignal contains mainly excitation pulses, with low-powerregions in between, and thus, allows a smooth reconstructionof the residual signal from the PWI surface with minimalphase discontinuities.In the proposed algorithm, the surfaces of the residual errorsignals, computed for each voiced phoneme of two differentspeakers, are interpolated to create an intermediate surface.Together with an intermediate pitch contour and an interpolatedvocal-tract filter, a new voiced phoneme is produced.