Chapter 9 EXPERIMENTS WITH VOICE QUALITY FEATURES
Chapter 9
EXPERIMENTS WITH VOICE QUALITY FEATURES
The classification of emotions using voice quality features is a brand new field of investigation, which is being used and referred in many lately studies concerned to emotion recognition (s [Joh99, Alt00]). However, this proposal faces different obstacles, due to the difficulty of estimation of this kind of attributes. Thus, diverse set of features and methods were tried during the development of this work.
9.1 Preliminary experiments
9.1.1 Experimental observation
In order to familiarise with voice quality properties detailed in chapter 3 and to find relations between these features and a given emotion, the same sentence uttered by the same speaker (id0008) in five different emotions is analysed. The selected sentence contains the German text “Kick den Ball” and was chosen among several AIBO commands because it contains three different vowels. As a result, changes in voice quality properties will be observed for phonemes /i/, /e/ and /a/.
As it was broader explained in chapter 3, the vocal tract is an acoustic system consisting of acoustic tubes and constrictions. The system has acoustic resonances that are relative to the lengths of the cavities. Since these lengths vary depending on the articulated phoneme, the resonances, known as formants, are different for each vowel. Furthermore, the human ear can recognize a vowel sound as the same vowel, even though it is uttered at different pitches. As a result, formant frequencies can be clearly considered as a key factor when distinguishing vowels from each other.
First of all, /i/ sound was analysed. Formant frequencies for this vowel can be observed in figure 9.1. It presents the lowest F1 resonance in comparison to /e/ (figure 9.2) or /a/ (figure 9.3). On the other hand, its F2 frequency is higher. Attending to these two spectral characteristics, /i/ vowel sound can be uniquely identified among other vowels.
However, though these formant frequencies characterise a given vowel, they are not strictly “fixed”. F1 and F2 remain within very specific limits of the frequency range, so that they can almost be said to be constant. The variation inside the range due to emotion is the motivation of the current experiment. We will try to find relations between quality features, such as formant frequencies, and emotions for each of the three measured vowels. It has to be noted that the sounds analysed during this experiment are not pure vowels, but voiced regions, selected after analysing voiced frames of the signal,where the vowel is included. Therefore, the formant values calculated are also influenced by the boundaries of the vowel.
In the following table 9.1, some measurements over the /i/ segment are compared for each emotion. It shows variations among the features measured over the same vowel /i/ of the same utterance but with different emotional content The table contains the formant frequencies (F1, F2, F3), formants bandwidths (Bw1, Bw2, Bw3) and statistics of Harmonic to Noise Ratio (HNR)
F1 / F2 / F3 / Bw1 / Bw2 / Bw3 / Max. HNR / MeanHNR / Std.
HNR
Angry / 327 / 2083 / 2637 / 174 / 155 / 349 / 15.2 / 13 / 2.5
Bored / 325 / 1987 / 2402 / 95 / 178 / 365 / 11 / 8.5 / 3.7
Happy / 341 / 2059 / 2305 / 238 / 124 / 954 / 11 / 11 / 0
Neutral / 336 / 1898 / 2542 / 90 / 181 / 255 / 16.3 / 11.8 / 4.1
Sad / 340 / 1878 / 2500 / 120 / 140 / 496 / 5.4 / 5.4 / 0
Table 9.1. Some quality features calculated over a voiced region containing the /i/ vowel from the sentence “Kick den Ball” uttered by the speaker id0008.
Formant frequencies are slightly different for dissimilar emotional states. Since these attributes are contained within a fixed range, in order to maintain the intelligibility of the vowel, the capability of these deviations to help in the emotional discrimination should be tested. From results it seems that formant bandwidths make some noticeable distinctions, principally first and third formant bandwidths are relevant to detect happy patterns. Harmonic to noise ratio is specially low for the sad utterances even in comparison with bored sentences, which can be an useful cue in the evaluation dimension of emotions (see chapter 2).
F1 / F2 / F3 / Bw1 / Bw2 / Bw3 / Max. HNR / MeanHNR / Std.
HNR
Angry / 348 / 1515 / 2406 / 470 / 295 / 399 / 23.5 / 12.6 / 8.3
Bored / 255 / 1610 / 2389 / 78 / 483 / 150 / 24 / 15.4 / 6.5
Happy / 342 / 1643 / 2505 / 163 / 817 / 81 / 20 / 12.2 / 3.5
Neutral / 270 / 1470 / 2425 / 40 / 864 / 148 / 25.2 / 18 / 6.1
Sad / 222 / 1669 / 2335 / 143 / 1080 / 168 / 25 / 19.6 / 5.2
Table 9.2. Some quality features calculated over a voiced region containing the /e/ vowel from the sentence “Kick den Ball” uttered by the speaker id0008.
For the second voiced region of the sentence, where vowel /e/ is present, the same set of quality features are presented in table 9.2. The bandwidth information seems to be useful to discriminate among emotions also in this case. Angry and happy are clearly distanced attended to these values, thus they could be discriminated on the evaluation axis.
In this case, formant frequencies and harmonicity do not provide a wide range of variation among emotional states and are, therefore, not suitable for emotional classification. It is experimentally observed that the variation of voice quality parameters due to emotions depend as well on the inspected vowel.
Finally, table 9.3 portrays values of the same features for the vowel /a/. Deviations around the theoretical first formant frequency are found, however second and third formants are not so influenced by the emotional state of the speaker. Bandwidths, as well as in /i/ and /e/ provide a wider range of variation among emotions.
HNR / Std.
HNR
Angry / 610 / 1292 / 2453 / 68 / 57 / 122 / 36.3 / 11.5 / 3
Bored / 473 / 1251 / 2476 / 319 / 145 / 356 / 19.9 / 16.2 / 2.7
Happy / 584 / 1273 / 2458 / 90 / 51 / 79 / 45.2 / 13.2 / 3.5
Neutral / 507 / 1320 / 2424 / 234 / 163 / 217 / 22.7 / 17.4 / 3.1
Sad / 600 / 1440 / 2579 / 325 / 171 / 571 / 21.8 / 17.6 / 3
Table 9.3. Some quality features calculated over a voiced region containing the /a/ vowel from the sentence “Kick den Ball” uttered by the speaker id0008.
Experiment 9.1.1 confirms the theory that voice quality features, despite not all the employed parameters have been tested, vary when the emotional state of the speaker changes. Therefore, experiments in order to discriminate emotions with help of this kind of features seem to be worth and feasible.
Section 9.1.2 and 9.1.3 are an overview of the first experiments tried. As well as in preliminary experiments of chapter 8, findings achieved during this stage are only considered as a reference and as a valuable test of the software employed in order to calculate the features. More systematic experiments and outcomes are given in sections 9.2 and 9.3 for the speaker dependent and speaker independent case respectively.
9.1.2 Speaker dependent
9.1.2.1 Sentence based. First approach.
Since many different vowels can be contained in a sentence, the first quality features approach selects similar voiced regions[1], in terms of formant structure, from every sentence. This idea is motivated by the experimental fact (9.1.1) that the variation of voice quality parameters due to emotions is different for different vowels. As a result, one unique voiced region will represent the whole utterance, corresponding to a single vowel. This selected part tries to be the most similar to an /a/, based on the formants structure within voiced regions. The criteria applied to choose the considered region are:
- Region with the highest f1 frequency and the lowest difference between f2 and f1 frequencies. This comes from the fact that /a/ has a relative high value of f1 in comparison to its f2 frequency
- When the difference between f2 and f1 is slightly higher than in the former case but the f1 frequency is much higher, this candidate is also considered. By means of this exception, some tolerance is added to the algorithm in order to avoid certain mistakes.
Since f1 must be high, while f2-f1 has to be low, the relation (8.1) is used to compare different segments. The region with the lowest value, without forgetting the tolerance explained at the second point, is chosen to represent the sentence.
(9.1)
Further experimental checking is made in order to confirm the reliability of this selection. A representative small set of utterances is listened and the chosen region is verified. The sound /a/ was selected in approximately 94% of the sentences where it appeared. The sound /o/ was next sound requested when /a/ was absent, due to similarities in their formant structure. This way, the intention is to compare features reflecting variations in the voice quality instead of in the phonetic structure. The explained criterion is implemented by means of the scripting possibilities of Praat. A short description of the algorithm is given in figure 9.4.
Current experiment tries to see the discrimination capability of a basic set of attributes. The features tried are basic voice quality parameters based on formant structure and harmonicity. Two different approaches are tried: First, a neural network is trained with five output nodes, each one corresponding to one of the emotions. Then, following the activation-evaluation theory, emotions are grouped according to their position in the evaluation axis. Angry is put together with sad and happy with bored, while neutral remains as an individual class. Therefore, a three output nodes network is trained under same conditions as before.
1. Objective:
-Speaker id0008 without listening test. From the available data (129 utterances), 70% is used for training (89 utterances), 15% for evaluation (22 utterances) and 15% for testing (18 utterances).
-QH.0d, QH.2, QF.1b, QF.2b, QF.3b, QF.7c, QF.8c and QF.9c (8 input nodes).
-Two output configurations:
a)Angry (10000), bored (01000), happy (00100), neutral (00010) and sad (00001).
b)Angry/sad (100), neutral (010) and bored/happy (001).
2. Conditions:
-Normalised by the maximum of the training set.
-No hidden layer; logistic activation function; Standard Backpropagation and Chunkwise Backpropagation learning algorithms with multiple step training.
-WTA analysis function (l=0 and h=0).
- Results and conclusions:
First attempt tries to show how these features contribute to distinguish among the five emotional states considered in the framework of this thesis.
OUTPUT / Angry / Bored / Happy / Neutral / SadINPUT
Angry / 75 / 0 / 0 / 0 / 25
Bored / 25 / 75 / 0 / 0 / 0
Happy / 50 / 25 / 25 / 0 / 0
Neutral / 25 / 0 / 25 / 0 / 50
Sad / 50 / 0 / 0 / 0 / 50
Table 9.4. Confusion matrix of experiment 9.1.2.1 a for the stdbp learning algorithm.
The highest recognition rate obtained for the five emotions classification is 45% after the fourth step of the standard backpropagation method. Confusion matrix is shown in table above 9.4.
It can be seen in figure 9.5 a strong tendency of the network to give higher values of angry. This is also reflected in table 9.4, where percentages of classification are always portrayed at the angry column.
Figure 9.5. Mean of the outcomes of experiment 9.1.1.1 a for the stdbp case.
For the three outputs case, the highest recognition rate, also obtained for the fourth step of standard backpropagation, decreases to 35%. As it was expected, when compared with first approaches using prosodic features, voice quality attributes seem to need deeper investigation and research, as far as their correspondence with emotions is not so well known as prosodic attributes relation.
9.1.1.2 Region based.
Next experiment considers all the voiced regions within a sentence, instead of choosing just one of them as representative, as done in former experiments. For the labelling of the units, every region inside an utterance, which expresses a given emotion, is also considered to belong to that given emotion.
Consequently, each input pattern of the network corresponds to a voiced region and is classified individually into one of the classes. A decision rule is then needed in order to assert a final conclusion for the whole utterance. This rule applies the following criteria:
-one sentence is said to belong to class X when most of its voiced regions are classified as X.
-In case standoff, the emotion of the region whose “winner” output value is higher prevails as the emotion for the utterance.
Five new features concerning the energy of the signal are added to the first basic set of experiment 9.1.1.1.
1. Objective:
- Speaker id0008 without listening test. The amount of patterns of each class, dependent on the number of voiced regions within utterances, results in 62 angry, 56 bored, 57 happy, 57 neutral and 55 sad. From this set, approximately 75% of the utterances (193 patterns) are used for training, 15% (232 patterns) for evaluation and 15% (55 patterns) for testing. This division takes into account not to separate regions of the same sentence in different sets, to make it possible to classify finally entire utterances.
-QH.0d, QH.2, QF.1b, QF.2b, QF.3b, QF.7c, QF.8c, QF.9c, QE.0b-QE.3b and QE.5 (13 input nodes).
-Angry (10000), bored (01000), happy (00100), neutral (00010) and sad (00001).
2. Conditions:
-Normalised by the maximum of the training set.
-No hidden layer; logistic activation function; Standard Backpropagation and Chunkwise Backpropagation learning algorithms with multiple step training.
-WTA analysis function (l=0 and h=0).
- Results and conclusions.
The overall recognition rate by voiced region is 63.63% and the recognition based in the whole utterance reaches 55%, but the success is unequally distributed. Bored is the best recognized class, while happy is never properly classified.
OUTPUT / Angry / Bored / Happy / Neutral / SadINPUT
Angry / 50 / 0 / 0 / 50 / 0
Bored / 0 / 100 / 0 / 0 / 0
Happy / 0 / 0 / 0 / 25 / 75
Neutral / 0 / 0 / 0 / 50 / 50
Sad / 0 / 0 / 0 / 25 / 75
Table 9.5. Confusion matrix of experiment 9.1.1.2.
9.1.1.3 Region based. Feature normalization.
One of the most important and recognized properties of the voice is its particular properties for a particular individual. As a consequence, important dependencies of the voice quality features on the speaker arise. For this reason, three different vectors are tried for normalization. For all the trials emotions are grouped according to their distribution in the evaluation axis (i.e. angry+sad, bored+neutral, happy). First normalization uses the mean value of each feature within neutral utterances as divisor The second normalization vector tried comes from the mean of each feature within bored and neutral emotions, since the grouping of both emotions yields the “centered” class and it should theoretically be the reference. Finally, the third normalization vector is calculated as the mean of each feature for all the patterns, independently from the emotion as usually. The set of features is extended to eighteen and the speaker considered is also id0008.
1. Objective:
- Speaker id0008 without listening test. The amount of patterns of each class, dependent on the number of voiced regions within utterances, results in 62 angry, 56 bored, 57 happy, 57 neutral and 55 sad. From this set of 287 patterns, 232 are used for training and evaluation and 55 for test. This division takes into account not to separate regions of a sentence in different sets, to make it possible to classify finally entire utterances.
-For every voiced region, following features are calculated: QH.0d, QH.3, QF.7b-QF.9b, QF.10-QF.15, P1.36 (jitter), QE.0b-QE.3b, QE.5 and QS.0a (18 input nodes).
-Two output configurations:
- Angry/sad (100), neutral/bored (010) and happy (001).
- Angry/sad (0.2), neutral/bored (0.6) and happy (1).
2. Conditions:
-Three different normalisation vectors:
a)Mean of the neutral class.
b)Mean of the bored and neutral classes.
c)Mean of all the patterns.
-No hidden layer; logistic activation function; Standard Backpropagation and Chunkwise Backpropagation learning algorithms with multiple step training.
-WTA analysis function (l=0 and h=0).
- Results and conclusions.
Experiment a yields a correct recognition rate of 35.84%, for the fourth step of the standard backpropagation learning algorithm. However, looking at the table 9.5, it can be seen that distribution of the recognition is unequal, and therefore, the classification doesn’t success.
OUTPUT / Angry / Sad / Neutral / Bored / HappyINPUT
Angry / Sad / 70 / 30 / 0
Neutral / Bored / 76.19 / 23.81 / 0
Happy / 66.67 / 33.33 / 0
Table 9.6. Results of experiment 9.1.1.5 a (normalized by the mean of the neutral patterns).
It can also be observed that happy is never well recognized. Later experiments showed that this failure comes from the fact that when emotions are grouped, double amount of data is found in the new joint classes, while the class not grouped maintains the same number of patterns, and an unbalanced training is performed. Since the neural network also models the probability of appearance and the training is uncompensated, happy is never chosen as the “winner”.
The second normalization vector tried (experiment b) comes from the mean of each feature within bored and neutral emotions, since the grouping of both emotions is the “center” reference. Results under identical conditions are slightly better, rising the rate to 41.51%. But the problem of unequally distributed success and happy recognition failure are still present.
Finally, the third normalisation vector is calculated as the mean of each feature for every pattern, independently from the emotion. In this case, recognition rate reaches 43.40%. From the results, the best vector for normalisation is the one composed by statistics of the whole set of patterns, as it’s usually done during this thesis.
One of the experiments was repeated varying only the output configuration. The continuous output is tried for a particular case and then compared with the analogous situation in the three outputs case. The performance of the classifier yields a lower recognition rate and maintain the unequally distribution of the success. Therfore, it seems that one output node is inadequate for quality features based classification.
9.1.1.4 Frame based.
Frame based classification consists in calculating features for every frame within the voiced parts of the sentence. Only the frames whose intensity value is significant are taken into account. This requirement comes from the observed fact that Praat calculates intensity values only at the core of the vowel, whereas fundamental frequency values are also given for boundary values that are not so representative of the voiced region. Then, every frame is labeled with the emotion of the sentence in which it is contained. After training and testing the data, and index between 0 and 1 (neural network range) is obtained for each output and frame. Finally, two different methods are employed to make the final decision for the whole sentence once all its frames are classified. These two methods apply the following criteria based on [Ami01]:
- Average of the emotional indices over the utterances (AV). For each output (class), the mean value of the outcomes for every frame of a sentence is calculated and the highest is the winner class.
- Percentage of the file during which each emotional index was greater than all the others (MAX), i.e. output of the network into which most input patterns have been classified is the winner.
Due to the different nature of the performed experiments (sentence, region and frame based), feature set has to be modified to fit the requirements of each situation. Most of the features used in this experiment are frame based and come directly from the single frame calculations of Praat. However, there are some, mainly based in medium-term variations, that only make sense if they are calculated for the entire utterance and whose value appears as the same feature in every frame belonging to the sentence.