Supplemental Materials

McMurray, B., Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325

Note 1: Phonetic Analysis

Jongman, Wayland, and Wong (2000) report extensive analyses on the measures from their database that we use here as the basis of our models. They show that, individually, each cue differed as a function of place, sibilance and/or voicing, and most of these cues also differed as a function of the vowel context and/or the gender of the speaker. However, no attempt was made to compare the amount of variance in each cue due to each factor (although hP2 was reported for many comparisons). Moreover, such an analysis has not been conducted for any of the new cues we measured here. Thus, we evaluated 1) which cues contribute to each categorical distinction; and 2) the contributions of contextual factors (speaker and vowel). This was done with a series of regression analyses that provide a standard effect size measure that can be compared across cues and effects. Crucially, we also used these analyses to highlight and explore the contributions of the newly proposed cues.

In each analysis, a single cue was the dependent variable, and the independent variables were a combination of dummy codes reflecting a single factor of interest, such as fricative identity (7 variables), voicing (1 variable), sibilance (1 variable), or place of articulation (3 variables). In each regression, we first partialed out the effect of speaker (19 dummy codes) and vowel (5 dummy codes), before entering the effect of interest into the model. These regression analyses are necessarily exploratory, and we do not intend to draw broad conclusions from them. They are intended to provide an overall view of these cues and the factors that contribute to their variance.

Results and Discussion

The results of the regression analyses are summarized in Table S1 (which shows the overall effects of fricative and context) and Table S2 (which shows the specific effects of each feature). There are a number of important results worth highlighting.

Fricative Identity.

Every cue was affected by fricative identity. While effect sizes ranged from very large (10 / 24 cues had R2change> .40) to very small (vowel RMS, the smallest: R2change =.011), all were highly significant. Even cues that were originally measured to compensate for variance in other cues (e.g., vowel duration was measured to normalize fricative duration) had significant effects. Spectral moments (especially the mean and variance) were particularly important, but, surprisingly, F2 (which has received substantial attention in the literature) had only a moderate effect (R2change=.119). Interestingly, its size was similar to that for F4 (R2change=.121) and F5 (R2change=.117), two cues which have not been previously examined.

Some cues could clearly be attributed to one feature more than to another, although there were no cues that were associated with only a single feature. Duration cues were clearly related to voicing (DURF: R2change=.403; DURv: R2change =.055), not place of articulation (DURF: R2change =.052; DURv: R2change =.004) or sibilance (DURF: R2change =.052, DURv: R2change =.002). This was the same for low frequency energy (Voicing: R2change =.482), although this cue may also be involved in sibilance detection (R2change =.120).


Other cues were clearly about sibilance. RMSF and F5AMPF were highly correlated with sibilance (RMSF: R2change=.419; F5AMPF: R2change=.394). They were also correlated with place of articulation (RMSF: R2change=.425; F5AMPF: R2change=.401), but when place was considered separately in sibilants and nonsibilants, little effect was seen in either, suggesting this cue primarily reveals sibilance. However, several other cues that were strongly associated with sibilance were also related to place of articulation. MaxPF and F3AmpF, for example, were strongly associated with sibilance (MaxPF: R2change=.260; F3AmpF: R2change =.239), but within sibilants were also useful for distinguishing alveolars and postalveolars (MaxPF: R2change=.504; F3AmpF: R2change=.444). Thus, these cues seem to be available to make two independent distinctions (sibilance in general, and place of articulation within sibilants).

Of the formant frequencies, F2, F4 and F5 had moderate effects that were primarily limited to place of articulation (F2: R2change=.114; F4: R2change=.119; F5: R2change=.116), and these cues appeared to be similarly useful for both sibilants (F2: R2change=.060; F4: R2change=.132; F5: R2change=.101) and nonsibilants (F2: R2change=.057; F4: R2change=.083; F5: R2change=.082).

Separate analyses examined place of articulation in sibilants (alveolar vs. postalveolar) and nonsibilants (labiodentals vs. interdentals) (Table S2). While there was a wealth of cues that were highly sensitive to place of articulation in sibilants, there were few that were related to place in nonsibilants, and these showed only moderate to low effect sizes. Of these, the best were F4 (R2change =.083), and F5 (R2change =.082) (two new cues for nonsibilants) and the skewness and kurtosis during the transition (M3trans: R2change=.061; M4trans: R2change=.062). As shown in Table S1, all of these cues are also highly context-dependent (F4: R2change =.478; F5: R2change=.339; M3trans: R2=.108; M4trans: R2=.10), suggesting that to take advantage of what little information there is for nonsibilants, listeners may need a compensatory mechanism.

Context Effects.

Contextual factors (speaker and vowel) accounted for a significant portion of the variance in every cue. Not surprisingly, speaker and vowel accounted for a massive amount of variance in cues like vowel duration (R2change=.792) and vowel amplitude (R2=.612), which were measured to capture some of the contextual variance. F0 and all five formants were also highly related to context with average effect sizes in the 40–60% range. For F1 and F2, this was largely due to the vowel (F1: R2change=.603; F2 R2change=.514), while for the other formants this was largely due to speaker. Most other cues showed much smaller effects, in the 8–10% range.

Unexplained Variance.

Finally, as Table S1 shows, there was a substantial amount of unexplained variance in each cue. For seven of these cues (F3AMPV, F5AMPV, F5, M3, M4, M2trans, and M4trans) there was more unexplained variance than the combined variance accounted for by speaker, vowel and fricative. For the others, this was still substantial. Even some of the best (near invariant) cues showed large amounts of unexplained variance, cues like MaxPF (42.3%), M2 (28.4%), M3 (54.8%) and M4 (70.4%). Of course, the presence of unaccounted for variance is common in regression analyses, but raises an interesting question here. This variance means that even fricatives that were spoken by the same person, in the same sentence context, and in the same recording session (e.g., repetitions 1, 2, and 3) differed substantially in terms of their acoustic realization. Across this data set, the only factors that systematically varied were speaker, vowel, and fricative identity, and these effects have been accounted for. Thus, the unexplained variance suggests that a large portion of the variance in speech cues may actually be due to random factors, that is, noise (see Newman, Clouse, & Burnham, 2001), that the listener must deal with. It does not appear that if we knew all of the relevant factors, we could account for all the variance in these cues.

Discussion.

There are several candidates for primary cues to place of articulation or sibilance, though none are completely invariant to context, and it is not clear that any are strongly related to place of articulation in nonsibilants. There were also no unique, contextually invariant cues to voicing.

Fricative identity as well as vowel and speaker affect virtually every cue we studied. This may present problems for models that conflate these sources of information like exemplar models. At the same time, some cues are likely to be more informative than others: peak frequency, the narrow-band amplitudes, and the spectral moments are strongly related to place of articulation; RMSF, the narrow-band amplitudes and M2 are strongly related to sibilance; and DURF and LF are strongly related to voicing (though they were also affected strongly by context). Most of the place cues were helpful with sibilants, and nonsibilants showed only weak relationships with primarily context-dependent cues. Finally, we found only a handful of cues that come close to invariance: MaxPF, narrow-band amplitudes in the fricative, and spectral moments 2–4 seemed somewhat context-independent, and cued various aspects of sibilance and place of articulation. As a whole, this strongly reinforces the notion that fricative identification requires the integration of many cues, and there are few, if any, that are invariant with respect to context and other factors.


Note 2: Analysis of Perceptual Data

The primary analysis used generalized estimating equations with a logistic linking function to approximate a mixed-design ANOVA with a binary dependent variable. The model included syllable-type as a between-subjects variable, along with vowel, speaker, place of articulation and voicing as repeated measures. Vowel and speaker were included in the model as main effects only. Accuracy was the dependent variable.

This analysis was fully reported in the paper, but follow-up analyses were also run separating the data by syllable-type in order to understand the two-way interactions of place and voicing with syllable-type. Each analysis included place, voicing and their interaction as primary factors while also including independent (noninteracting) effects of vowel and speaker.

In the complete-syllable condition we found a significant main effect of speaker (Wald c2(9)=135.5, p<.0001). Z was not significant individually (Wald c2(1)=2.0, p=.156; Figure S2C).

In the fricative-noise condition, speaker was still significant (Wald c2(9)=196.9, p<.0001), but vowel was no longer significant (Wald c2(2)=.9, p=.6; Figure S2B). This implies that the vowel effect seen in the complete-syllable condition was not due to the fact that the particular frication produced before an /i/ (or any other vowel) was more (or less) ambiguous. Heard alone, there was no effect of vowel. Rather, the vowel contributes something beyond simply altering the cues in the frication. As before, place of articulation was significant (Wald c2(3)=189.8, p<.0001; Figure S1A), with all three places differing significantly from postalveolars (labiodentals: Wald c2(1)=38.1, p<.0001; interdentals: Wald c2(3)=65.0, p<.0001; alveolars: Wald c2(1)=10.9, p=.001). This time, voicing was significant (Wald c2(1)=15.0, p<.0001), with better performance on voiceless than voiced sounds. The voicing ´ place interaction was also significant (Wald c2(1)=34.5, p<.0001; Figure S1D) due to a significant effect of voicing in interdentals (Wald c2(1)=12.0, p=.001), but not in labiodentals (Wald c2(1)=.4, p=.5) or alveolars (Wald c2(1)=.5, p=.48).

To summarize our findings, we found that 1) performance without the vocalic portion was substantially worse than with it; 2) performance varied substantially across speakers; 3) sibilants were easier to identify than nonsibilants but there were place differences even within the sibilants; 4) voicing effects were largely restricted to the interdental fricatives; and 5) the identity of the vowel affected performance, but only in the complete-syllable condition. Thus, either particular vowels alter the secondary cues in the vocalic portion that mislead (or help) listeners, or the identity of the vowel causes subjects to treat the cues in the frication noise differently. Most likely it is the latter—the lip rounding created by /u/ has a particularly strong effect on the frication, and listeners’ ability to identify the vowel (and thus account for these effects) may thus offer a large benefit for fricatives preceding a /u/ that is not seen for the unrounded vowels.


Note 3: Confusion Matrices

While our primary analysis of the empirical data (and the model) focused on the overall accuracy as a function of fricative, vowel and speaker, listeners’ (and models’) responses were not dichotomous. Rather, listeners (and models) selected which of the eight fricatives was their response for each stimulus. In this section, we present the confusion matrices (the likelihood of responding with a given fricative given the one that was heard) as an alternative metric for evaluating the experiment and models. This necessarily ignores the context effects, but it paints a parallel picture to the analysis presented in the manuscript: the compensation / C-CuRE model performs like listeners in the complete-syllable condition, and the cue-integration model succeeding in the frication-only condition.

3.1 Listener Data


Table S3 shows confusion matrices for each condition in the perceptual experiment. In the complete-syllable condition, listeners were accurate overall (M=91.2%), particularly on the sibilants (M=97.4%). The only systematic confusions were for nonsibilants and within these participants typically chose the wrong place of articulation, but maintained voicing. For example, when /f/ was miscategorized, /v/ was selected 1% of the time, but /ɵ/ 10.4% of the time. Similarly, /v/ was confused with /ð/ 8.5% of the time, but with /f/ only 0.8%. The only exception was /ð/, the most difficult fricative (M=74.4%), which showed confusions for both voicing (/ɵ/: 7.5%) and place (/v/: 17.0%).

In the frication-only condition, performance dropped substantially (M=76.3%), but the overall pattern remained. For nonsibilants, the majority of confusions were in terms of place of articulation (M=27.0% across all four), though there were more confusions in voicing (M=8.1%). The error rate for sibilants was higher than in the complete-syllable condition, but there were still few, and they slightly favored voicing (M=3.9%) over place (M=1.0%).

Across both conditions, confusions respected sibilance. Nonsibilants tended to be confused with other nonsibilants (Complete-syllable: M=14.5%; Frication-only: M=38.4%) rather than sibilants (Complete-syllable: M=0.6%; frication-only: M=2.2%); and while sibilants were rarely confused in the complete-syllable condition, sibilants tended to be confused with other sibilants in the noise-only condition (M=5%) and rarely labeled as nonsibilants (M=1.7%).

3.2  Invariance Model

The confusion matrix for this model (Table S4) shows some similarities to listeners but also major differences. Like listeners, this model was more likely to confuse place of articulation than voicing. However, unlike listeners, /f/ was classified as /v/ 4.8% of the time (listeners: M=1.0%). Also, like listeners, the model’s errors tended to respect sibilance, yet /f/ and /ð/ were exceptions to this. More surprisingly, all the sibilants showed a small but noticeable rate of confusion with nonsibilants.


Finally, there were also a number of confusions that did not seem to resemble listeners at all. /ð/ was classified as /f/ and /z/ at very high rates (5.2% and 5.4%, respectively) compared with listeners (.3% and .5%). Conversely, /ð/ was rarely classified as /ɵ/ (1.7%) while that was common for listeners (7.5%). Finally, when listeners heard /s/, virtually all errors were to /z/ while the model’s errors were evenly distributed across all three other sibilants. Thus, the pattern of errors in the invariance model does not seem well correlated with those of listeners.