The effects of synthesized anchors on
the reliability of perceptual voice evaluation
Mok, Siu Man Rosa
A dissertation submitted in partial fulfillment of the requirements for the Bachelor of Science (Speech and Hearing Sciences), The University of Hong Kong, April 30, 2003
ABSTRACT
Perceptual voice evaluation is crucial in voice research and clinical practice. However, serious concerns arise regarding its reliability due to the subjective nature of perceptual voice measure. A descriptive framework proposed by Kreiman, Gerratt, Kempster, Erman and Berke (1993) states that listeners judge the quality of a voice signal by comparing the voice stimuli with their internal mental standards. These standards are unstable and vary substantially among listeners. One of the ways to improve the reliability in voice quality perception is to provide listeners with external anchors.
The primary objective of this study was to investigate the reliability of listeners to find the best match of a pathological voice sample from a set of synthesized anchors with varying roughness and breathiness in perceptual voice evaluation. The second objective was to determine whether the provision of anchors increased listener’s confidence in rating voices. Twenty-five undergraduate speech pathology students were asked to rate the severity of roughness and breathiness of natural pathological voice samples using two scales: an eight-point equal-appearing interval (EAI) scale and an eight-point EAI scale with synthesized anchor for each scale point. They also rated their confidence in judging each testing stimulus on a seven-point EAI scale. The results showed, in general, the provision of anchors improved the intra-rater agreement in rating femalebreathy stimuli and the inter-rater reliability in rating normal, mild and moderate stimuli, while the confidence levels were generally lower in the anchored scale. These results suggest that explicitly anchored protocol may be used as an assessment toolto improve the reliability of clinical perceptual voice evaluation.
INTRODUCTION
Perceptual voice judgment is crucial in the evaluation and treatment of individuals with voice disorders. The results help clinicians to plan voice therapy and to document treatment outcome. The importance of perceptual voice evaluation is also demonstrated by its frequent use as a criterion validation of instrumental voice measures, such as acoustic and aerodynamic measurements (Moran & Gilbert, 1984; Yumoto, Sasaki, & Okamura, 1984). However, serious concerns arise regarding its reliability due to the subjective nature of perceptual voice measures. The lack of widely accepted definitions of voice qualities and the inherently multidimensional nature of pathological voice samples also increase the complexity of perceptual voice ratings. A review by Kreiman, Gerratt, Kempster, Erman and Berke (1993) showed that both intra-rater and inter-rater reliability varied greatly in perceptual voice studies, and inter-rater reliability was often more of a concern than the intra-rater reliability (Gerratt, Kreiman, & Berke, 1994; Kreiman & Gerratt, 1998, 2000). The establishment of listeners’ reliability in rating pathological voices therefore became a central issue in perceptual voice evaluation as it laid the foundation of voice assessment.
Kreiman and her associates (1993) proposed a number of factors that would contribute to listeners’ variability in perceptual voice rating. According to their framework, perceptual voice rating is analogous to a matching task in which listeners compare external stimuli that they are to rate with their own internal (mental representation) standards of voice quality. These mental representations are formed from listeners’ prior experience with voices and they vary substantially from one individual to another. Moreover, these internal standards are unstable and may be influenced by internal and external factors other than the acoustic features of voice stimuli being rated, such as memory, attention and acoustic context (Gerratt, Kreiman, Antonanzas-Barroso, & Berke, 1993; Kreiman, Gerratt, Precoda, & Berke, 1992). All these factors contribute to listener disagreement in scalar ratings of voice quality and explain the wide range of variability among listeners in perceptual voice evaluation.
Due to the unstable mental representations of voice quality stored in memory, the provision of external voice anchors has been suggested to reduce the variability in perceptual voice quality rating. Such protocol provides the listeners with a constant set of references for different voice qualities, thus replacing the unstable internal quality standards (Gerratt et al., 1994; Kreiman et al., 1993). A number of studies (e.g. Chan & Yiu, 2002; Gerratt et al., 1993) have shown that the provision of external anchors improved the participants’ reliability in perceptual voice evaluation. For example, Gerratt et al. (1993) used a non-anchored five-point equal-appearing interval (EAI) scale and a five-point EAI scale with each of the scale point represented by a synthetic voice sample to rate the severity of roughness of synthetic stimuli. Their findings showed that the anchored protocol achieved significant improvements in intra- and inter-rater reliability relative to the non-anchored scale. However, the relative simplicity of the synthetic testing stimuli used in their study made the rating task much easier than the perceptual judgments of natural dysphonic voice samples. The use of sustained vowel in perceptual voice rating also limited the conclusions that might be drawn in their study as it is generally regarded that connected speech is more representative to daily voice use (e.g. Hammarberg, Fritzell, Gauffin, Sundberg, & Wedin, 1980; Revis, Giovanni, Wuyts, & Triglia, 1999; Yiu, Worrall, Longland, & Mitchell, 2000). More recently, Chan and Yiu (2002) employed the synthesized and natural voice anchors to determine the effects of provision of anchors on the reliability in perceptual voice rating. Their findings showed that the provision of synthesized anchors facilitated a better reliability than the use of natural voice anchors or no provision of anchors. However, only three anchors at different severity levels (normal, mild and moderate) were given in their study and the listeners still needed to refer to their internal standards of dysphonic qualities. In the present study, a range of anchors with varying severity levels of a particular voice quality were provided so that listeners were asked to match any given natural voice sample with an anchor. The selected anchor then represented the listener’s rating of the voice sample. In addition, as connected speech is considered as more suitable material than sustained vowels in representing an individual’s daily voice use in voice analysis (Hammarberg et al., 1980; Yiu et al., 2000), the present study used connected speech as the external anchors. The present study further investigated how confident the listeners were in using the anchored and non-anchored scales. As both voice research and clinical practice are often built upon perceptual voice measures, it is important to use a rating scale that can foster the confidence of clinicians or researchers in perceptual voice evaluation.
Two types of external anchors, natural and synthesized voice samples, have been reported in the literature to improve the reliability of perceptual voice evaluation (e.g. Chan & Yiu, 2002; Gerratt et al., 1993). Although natural voice anchors and dysphonic samples to be evaluated are closer in nature, it is often difficult to find a set of natural dysphonic voice samples with systematic variation of individual parameters, as natural samples usually show a combination of several perceptual qualities. Furthermore, there must first be a large database of natural pathological voice samples from which the appropriate anchors can be selected. These factors limit the value and ease of using natural voice samples as anchors. Synthesized signals, on the other hand, do not have these constraints. Synthesis parameters can be manipulated individually and systematically to create signals with varying severity and types of quality (Yiu, Murdoch, Hird, & Lau, 2002). Other strengths of synthesized signals include simplicity and the ease of reproducibility. When compared with the natural voice samples, less acoustically complex signals can be produced by synthesis techniques as the acoustic characteristics of synthesized signals are determined by the manipulation of synthesis parameters. This may make the investigation of how acoustic features affect voice quality perception easier. The relative simplicity of synthesized anchors also facilitates replication of studies as other researchers can reproduce these anchors if the synthesis parameters of the anchors are provided (Martin & Wolfe, 1996). Chan and Yiu (2002) employed natural and synthesized signals as anchors and their results showed that synthesized anchors were more effective in improving the reliability of trained listeners in perceptual voice judgment. The present study will therefore use synthesized voice signal as external anchors. The Klatt synthesizer, which is commercially available and used in a number of studies (e.g. Bangayan, Long, Alwan, Kreiman, & Gerratt, 1997; Chan & Yiu, 2002; Yiu et al., 2002), was employed in the present study to create synthesized signals. It was chosen as it has been shown that the synthesizer is capable of creating breathy and rough signals (Yiu et al., 2002). Moreover, other researchers can explore this area without the requirement of more sophisticated program as this program can be run on a personal computer with the Window platform (Klatt & Klatt, 1990).
The reliability of perceptual voice evaluation also depends on the type of voice quality to be rated. Reports showed that researchers commonly investigated roughness and breathiness in perceptual voice evaluation due to their importance in describing a wide range of voice pathologies (Chan & Yiu, 2002; Kreiman et al., 1993; Martin & Wolfe, 1996; Revis et al., 1999). These two perceptual qualities are also the most familiar labels in perceptual voice judgment and are shown to demonstrate higher reliability than others in perceptual evaluation (Gerratt et al., 1994; Hammarberg et al., 1980; Oates & Russell, 1998; Wolfe, Fitch, & Matin, 1997). Roughness is perceived as an irregular quality and a lack of clarity that is due to the aperiodic vibration of the vocal folds (Chan & Yiu, 2002). Breathiness is perceived as audible air escape and frication noise during speech production because of incomplete closure of the vocal folds (Hirano, 1981). The reliability of evaluating these two voice qualities using the two rating paradigms was investigated in the present study.
The present study had two objectives. First, it aimed to investigate the reliability of listeners to find the best match of a natural voice signal from a set of synthesized anchors with varying roughness and breathiness in a perceptual voice evaluation task. Unlike previous perceptual voice studies using rating with an anchor paradigm (Chan & Yiu, 2002; Gerratt et al., 1993), listeners in the present study would not be required to give perceptual ratings to the presented stimuli using the anchored protocol. Instead, synthesized anchors covering a range of severity levels of roughness and breathiness would be provided and the listeners were asked to choose the synthesized anchor that best matched with the breathy and rough quality of the target voice sample. The selected synthesized anchor for each quality then represented the listener’s perceptual rating of roughness or breathiness of the target stimulus. The listeners also rated the same set of testing stimuli using a non-anchored scale as a control task. It was hypothesized that when listeners compared the presented stimuli with the explicit external reference stimuli using the anchored protocol, the intra-rater and inter-rater reliability would be higher when compared to those of the non-anchored scale, which they compared the testing stimuli with their unstable internal representations.
The second objective of this study was to investigate whether the listeners were more confident in perceptual voice rating when synthesized anchors were provided. It was hypothesized that listeners would be more confident in voice ratings when using the anchored scale as they could match the target voice samples with the explicit synthetic anchors provided. Therefore, they did not need to refer to their unstable mental representations of voice qualities.
PILOT STUDIES
The objective of the pilot study was to develop a set of synthesized signals that were to be used as anchors in the main study for signaling roughness and breathiness. Synthesized stimuli were selected so that 1) they covered a range of severity levels of roughness and breathiness; 2) they were perceptually distinguishable from the next or previous stimuli in the roughness or breathiness series by at least 80% of naïve listeners in the pilot study.
Method
Preparation of synthesized stimuli
Anchor stimuli based on the Cantonese sentence /pa pa ta p/ (father hits the ball) were used in the present study. This sentence was chosen as all consonants are unaspirated plosives sothat any possible frication noise originated from fricatives or aspirated consonants canbe eliminated to reduce the risk of masking the breathy voice quality arising from the laryngeal level (Chan & Yiu, 2002).
Synthetic continua of roughness and breathiness for each gender was created using HLSyn Speech Synthesis System (version 2.2; Sensimetrics) in a Microsoft Window platform. The HLSyn is a Klatt synthesizer (Klatt & Klatt, 1990) with the addition of ‘high-level’ synthesis parameters. In this study, only the ‘low-level’ or original Klatt’s synthesis parameters were used.
A prototype (normal voice quality) Cantonese sentence /pa pa ta p/ was synthesized for each gender. The synthesis parameters (fundamental frequency, formant frequencies and duration of vowels) for each prototype sentence were based on those used by Yiu et al. (2002).The synthesis parameters that were related to voice qualities (amplitude of aspiration [AH], diplophonia [DI], amplitude of voicing [AV] and spectral tilt [TL]) were varied systematically to create different severity levels of roughness and breathiness.
Breathiness
Studies (Klatt & Klatt, 1990; Yiu et al., 2002)have shown that the addition of aspiration noise (AH) and spectral tilt (TL) to the stimuli would result in the perception of breathiness. Therefore, breathy quality was synthesized by varying the parameters AH and TL.The AH value was varied with 5dB steps (with the TL value set at 0dB) until it reached the maximum AH value, i.e. 80dB. From then onwards, TL was varied with 20dB steps (see Table 1).
Roughness
Yiu et al. (2002) found that when the AV value was at 80% and the DI value was below 4%, the signals were perceived as rough. When the DI value was increased further beyond 4%, the vocal fry quality became more prominent (Yiu et al., 2002). As vocal fry was not a major focus of the present study, experimentation was carried out to explore what synthesis parameter would produce reasonable roughness effect, but minimal vocal fry effect. It was found that an AH value of 80dB added to each rough signal was able to reduce the perception of vocal fry. Therefore, rough quality was synthesized by varying the parameters AV, AH and DI. The DI value was varied with 4% steps, with both AV and AH values being set constant at 80% and 80dB respectively (see Table 1).
When the AV parameter was set above 65% and the AH parameter was set above 75dB, amplitude clippings were noted in the waveforms of the signals. The gain controls of voicing (GV), aspiration (GH) and frication (GF) were adjusted accordingly from 60dB to eliminate the amplitude clippings. The values of GV, GH and GF were finally set at 57dB for breathiness in both gender sets,42dB for the female roughness set and 43dB for the male roughness set.
All other synthesis parameters were held constant at the Klatt’s recommended default values. The synthesized stimulus thresholds of roughness and breathiness recommended by Yiu et al. (2002) were used as the threshold stimuli in the first pilot study (Pilot 1). Stimuli were synthesized at a sample rate of 11025Hz using the HLSyn. The anchors were equalized for peak amplitude and upsampled to 44100Hz using Cool Edit prior to playback by a computer. The number of stimuli synthesized for each gender quality was limited by the beginning threshold and the maximum synthesis value for a particular quality without producing distorted signals.
Table 1. The synthesis values for the stimuli.
Female breathiness / Female roughness / Male breathiness / Male roughnessPROTOTYPE
AH50
AH55
AH60
AH65
AH70
AH75
AH80
AH80TL20
AH80TL40 / PROTOTYPE
AV80AH80DI2 AV80AH80DI6
AV80AH80DI10 AV80AH80DI14
AV80AH80DI18
AV80AH80DI22
AV80AH80DI26
--
-- / PROTOTYPE
AH65
AH70
AH75
AH80
AH80TL20
AH80TL40
--
--
-- / PROTOTYPE
AV80AH80DI2 AV80AH80DI6
AV80AH80DI10 AV80AH80DI14
AV80AH80DI18
AV80AH80DI22
AV80AH80DI26
--
--
Note. Default values for the prototype sentence: AH40, AV60, DI0, TL0
Pilot 1
Subjects
Ten naive listeners (mean age = 20.4, SD = 0.70, range = 20-22) were asked to serve as judges. They were undergraduate students from the University of Hong Kong and were native Cantonese speakers with reported normal hearing, normal voice and healthy condition. None of them had received any training in perceptual voice evaluation.
Procedures
Stimuli were presented through Microsoft PowerPoint 2000 at a comfortable listening level in a sound treated room using a pair of professional-quality headphones (Sennheiser, HD 25) through a Creative Extigy Signal Processing unit. The synthesized stimuli in each of the roughness and breathiness series in Table 1 were pairedwith the next stimuli up in the seriesfor presentation to the listeners. Seven stimulus pairs were resulted in the roughness series in both gender sets, while nine and six stimulus pairs were resulted in the breathiness series in the female and male sets respectively. The stimulus pairs were presented in four blocks of trials, each block included one set of stimuli, namely the female breathiness, female roughness, male breathiness, and the male roughness sets. The judges listened to these stimulus pairs and decided whether the voice pairs were identical or not.The presentation order of the four blocks was randomized across the listeners. Four practice trials were given at the beginning of the test in order to familiarize the listeners with the task. The practice items were synthesized by varying the ‘flutter’ parameter, which was not a parameter of interest that was used in the present study. Listeners completed all four blocks of trials within approximately 20 minutes.
Results for Pilot 1
For the breathiness series, all listeners perceived the male stimulus pairs as different in severity, while at least 80% of the listeners perceived the female stimulus pairs from AH 60 to AH80TL40 as different. For the roughness series of both gender sets, the percentage of listeners that perceived the stimulus pairs as different in severity varied from 60% to 100% (see Table 2).