Periodic Progress Report
Research Training Network
Hearing, Organisation and Recognition of Speech in Europe
HOARSE
Contract N°: HPRN-CT-2002-00276
Commencement date of contract:1/9/2002
Duration of contract (months): 48
Period covered by the report:1/9/2003 to 31/8/2004
Coordinator:
Professor Phil Green
Department of Computer Science, University of Sheffield
Regent Court, Portobello St.,
SheffieldS1 4DP, UK
Phone: +44 114 222 1828: Fax: +44 114 222 1810 : e-mail
HOARSE Partners
1. The University of Sheffield [USFD]coordinator
2. Ruhr-Universitat Bochum[RUB]
3. DaimlerChrysler AG[DCAG]
4. HelsinkiUniversity of Technology[HUT]
5. Institut Dalle Molle d’Intelligence Artificielle Perceptive[IDIAP]
6. LiverpoolUniversity[UNILIV]
7. University of Patras[PATRAS]
Part A. Research Results
A.1 Scientific Highlights
At least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work so far.
At Sheffield, doctoral researcher JANA EGGINKcontinues her work on automatic instrument recognition in polyphonic music under Task 1.5., auditory scene analysis in music. A system for the recognition of the solo instrument in accompanied sonatas and concertos has been developed. Compared with the previous system, which used a missing feature approach for instrument recognition in music with only a low number of concurrent tones, a change of focus from the background towards the foreground has taken place. Instead of identifying regions dominated by interfering sound sources, only the harmonic series belonging to the dominant instrument is identified and used for recognition. Test material is taken from commercially available classical music CDs, without placing any restrictions on the background accompaniment. The recognition accuracies achieved are comparable to those of systems developed to deal with monophonic music only. In an additional step, knowledge about the solo instrument is used to extract the F0s of the main melodic line played by this instrument. Combining different knowledge sources in a probabilistic framework led to a significant improvement in F0s estimation when compared to a baseline system using only bottom-up processing.
At Bochum, doctoral researcher JUHA MERIMAA has concentrated on HOARSE Tasks 2.1 (Researching the precedence effect), 2.2 (Reliability of auditory cues in multi-source scenarios), and 2.3 (Perceptual models of room reverberation with application to speech recognition). A novel auditory modeling mechanism predicting localization under precedence effect, multi-source, and reverberant conditions has been proposed. The model is currently investigated further by gathering new experimental data on the precedence effect. The perception of room reverberation has also been investigated in a study of spatial impression. The first part of this work included developing the experimental methods, as well as finding and training suitable test subjects for the listening experiments. The ongoing work concentrates on the effect of conflicting binaural cues on perception and on grouping of the cues to those related to sound sources and acoustical environments. Furthermore, a novel method for multi-channel loudspeaker reproduction of room reverberation has been developed in collaboration with HUT, leading to several joint papers.
At DCAG, doctoral researcher JULIEN BOURGEOIS is working on Task 4.2, informing speech recognition. This year, he concentrated on the comparison between linear blind source separation (BSS) methods and minimum-variance (or beamforming) techniques for the separation of the driver and co-driver speech in cars. We observed experimentally that BSS performs poorly when microphones are placed on the roof, as close as possible to the mouth of each speaker. We examined this theoretically and showed that when the input signal-to-interference ration (SIR) at the microphone is above a certain threshold, BSS is not able to bring any crosstalk-reduction, whereas beamforming still performs SIR improvement. Another limitation of BSS methods is their slower convergence on so-called non-causal mixtures, which arises for example if the two speakers are on the same half-plane defined by the microphone position. As a consequence, it is not advantageous to incorporate spatial prior information (available in cars) as hard constrains on the separation filters. This finding is confirmed in other experimental settings. Therefore, classical beamforming methods are preferable whenever reasonable speaker activity detection can be achieved. In Task 5.1, Speech recognition evaluation in multi-speaker conditions, DCAGmade additional recordings using the commercial S-Klasse mirror beamformer and close-talk microphones. Further multi-speaker recognition evaluation: on these recordings, BSS methods performed lower word error rate reduction than beamforming.
At HUT Helsinki, working on HOARSE Tasks 3.1, Glottal excitation estimation and 3.2, Voice production studies doctoral researcher EVA BJORKNER has studiedphysiological differences between chest and head register in the female singing voice were studied by inverse filtering the oral airflow recorded for a sequence of /pae/ syllables sung at constant pitch and decreasing vocal loudness in each register by seven female musical theatre singers. Ten equidistantly spaced subglottal pressure (Ps)values were selected and the relationships between Ps and several parameters were examined. The normalised amplitude quotient (NAQ) was used for measuring glottal adduction. Development and evaluation of inverse filtering has been studied using physiological modelling of voice production as well as high-speed digital imaging of the vocal folds fluctuation. Thus this experiment combines several Ps -values with NAQ to measure glottal adduction.
At IDIAP, researcher VIKTORIA MAIER studied contextual and temporal information in speech and its use in ASR: The relevant HOARSE Task is 4.2, Informing Speech Recognition.
- The classic experiment of Liberman et al (1952) was re-designed and perceptual test run on a group of 37 listeners. Results were broadly consistent with Liberman et al (1952). The implications for HMM-based speech recognition systems were discussed.
- The importance of the number of emitting states in a model and the relationship of phoneme duration has been analyzed.
Viktoria Maier is about to leave the HOARSE network and continue her doctoral work at Sheffield, under different funding.
Also at IDIAP, doctoral researcher PETR SVOJANOVSKY is extending the TRAP-TANDEM model proposed by IDIAP. The main effort is towards universal classifiers frequency-localized patterns, extending (Hermansky and Jain, Eurospeech 2003). Recently an interesting and apparently effective method of training a classifier on a particular frequency band and applying it also at all other frequencies has emerged Svojanovsky’s work. The HOARSE task involved here is 4.3, Advanced ASR algorithms.
Svojanovsky was also involved in ASR experiments with nonsense syllables. This database in principle allows for evaluation of automatic recognizer, independently of any language-level constraints. HOARSE Task 5.1, Speech recognition evaluation in multi-speaker conditions.
Doctoral researcher GUILLAUME LATHOUD is working under task 5.2 (signal and speech detection in sound mixtures) on overlaps between speakers. Previously proposed microphone array-based speaker segmentation methods were extended into a generic short-term segmentation/tracking framework [Lathoud et al. 04] that successfully copes with unkown number of speakers and unkown speakers' locations. An audiovisual database called AV16.3 is now accessible online [Lathoud et al. 04] including a variety of multi-speaker cases, 3D location annotation and some speech/silence segmentation annotation. Recent work focused on sector-based multiple sources detection and localization [Lathoud et al. 04].
At Liverpool, the work of doctoral researcher ELVIRA PEREZ has concentrated on Task 1.3,active/passive speech perception. After a year on a Fulbright fellowship she has now returned to Liverpool. Two sets of experiments that test whether listeners actively predict the temporal or spectral nature of masking sounds were conducted. The experiments evaluated speech intelligibility in two contexts:
- regularly spaced and randomly spaced noise bursts (to test temporal prediction)
- using a predictable and unpredictable frequency modulated sinewave that could be integrated into the speech percept or heard as a separate sound.
Both experiments confirm that our ability to segregate signals from maskers does not exploit (or rely on) regularity of the masker. A paper on this work is in preparation
.
Also at Liverpool, post-doctoral researcher PATTI ADANK has worked on Task 1.4: Envelope information and binaural processing. Adank concentrated on the use of voice characteristics to help segregation of simultaneous speakers. Previous work has shown that listeners are able to segregate spatially disparate signals much better when spoken by different speakers (Darwin and Hukin, 2000). We hypothesized that a two-stage process may first segregate the signals on F0 and than bind components together using cues such as speaker location of voice characteristics (cf. Darwin et al., 2003). Important voice characteristics are local amplitude modulation (flutter) or random F0 variation (jitter). We tested whether jitter can be used as a primary or secondary segregation cue because previous modelling work (Ellis 1993) has shown that jitter can be extracted by computational models and used for grouping. In a first experiment synthetic vowel pairs were synthesized with a range of jitter and F0 values. We show that while F0 differences lead to improved recognition, manipulation of the F0 jitter does not and therefore conclude that jitter is not a primary grouping cue. In a second set of experiments listeners were presented with sentences that were synthesized with pitch and jitter differences to test whether jitter might aid stream formation. Again our results show that the introduction of jitter does not aid in the segregation of sentences. This leaves the intriguing question how speaker specific information aids stream formation. A technical report on this work is available.
Other work at Liverpool has addressed Task 2.2 Reliability of auditory cues in multi-cue scenarios. A key question for systems that have to integrate multiple cues is how to combine and weight the different cues that are available. At Liverpool a range of experiments that examine combination rules for low-level auditory and visual motion signals were examined. Three models for cue integration were formalised: independent decisions, probability summation (i.e. independent local decisions) and linear summation (i.e. direct integration of the signals before decisions are made). Results show that human observers use probability summation for signals that are not ecologically plausible and linear summation for signals that are ecologically plausible. The work was presented at ICA2004, Kyoto. A paper on this topic has been accepted for publication.
Liverpool are collaborating with Bochum and Sheffield in Task 4.1: Informing speech recognition. Liverpool carried out initial studies aiming to use linear prediction of the energy in 32 channels of an auditory filterbank to predict noise spectra based on past data. The results, based on the AURORA noises, show that short term prediction should lead to much better noise estimates than measures such as the long term average. The gains are larger for non-stationary noises than for stationary noises by virtue of the long term average being an already good predictor. The current aim is to record a database of typical environmental noises to evaluate the system with a reasonable sample of sounds. With help from Bochum Liverpool built a set of in-ear microphones that can be used with a DAT recorder to record binaural environmental sounds and are now collaborating with Sheffield to make the recordings.
The team at Patras is engaged on several HOARSE tasks. The post-doctoral researcher involved is JOHN WORLEY (previously at Bochum).
Task 2.3: Perceptual models of room reverberation with application to speech recognition: Work has been performed based on the use of smoothed room response measurements. The tests have illustrated some novel aspects of response measurements when employed for real-time room acoustics compensation and also the robustness of the method based on smoothed room response. This work is forming the starting point for further tests, which are described in Task 2.4.
Task 2.4: Speech enhancement for reverberant environments:: John Worley has designed an experiment that tests the spatial quality and sound efficacy of a complex smoothed room response filter. The initial stage of the experiment has been completed which has involved the building of two Graphical user interfaces to obtain subjective data as to various aspects of spatial quality (source width, envelopment, and room size) and sound quality (phase clarity, spectral balance, loudness, and overall sound quality). The testing will reveal the factors that listeners consider important when assessing reverberation characteristics of a room. Some work is also in progress on the use beamforming arrays for use in speech enhancement and ASR tasks.
Pursuing Task 2.1: Researching the Precedence effect, Worley travelled to Bochum to test subjects on the Franssen illusion within different sized rooms, and with a range of onset transitions. He completed three experiments in Bochum, which show that the traditional illusion breaks down when it is performed within a large hall. The preliminary conclusion for this work is that for the precedence effect to work, the secondary signal in the Franssen illusion must not be active until the listener has received the reflections within the room. Therefore, congruent with the ‘plausibility hypothesis’ the secondary signal will be perceived as a reflection and the illusion will operate.
A.2 Joint Publications and Patents
Publications
IDIAP and USFD
- Andrew C. Morris, Viktoria Maier and Phil Green, “From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition”, in International Conference on Spoken Language Processing (ICSLP), Jeju Island, Korea, 2004
HUT and USFD
- Palomäki, K. Brown, G., and Barker, J., 'Techniques For Handling Convolutional Distortion With `Missing Data' Automatic Speech Recognition”, Speech Communication Vol. 43, no. 1-2, pp.123-142, 2004
- Palomäki, K., Brown, G., and Wang, D., ''A Binaural Processor for Missing Data Speech Recognition in the Presence of Noise and Small-Room Reverberation,'' Speech Communication, 2004. In press.
Patents
HUT and Bochum
Merimaa, J & Pulkki, V: Perceptually-Based Processing of Directional Room Responses for Multichannel Loudspeaker Reproduction, Proc. IEEE WASPAA, New Paltz, NY, USA, 2003, pp. 51-54.
Pulkki, V, Merimaa, J & Lokki, T: Multi-Channel Reproduction of Measured Room Responses, 18th International Congress on Acoustics, Kyoto, Japan, 2004, pp. II 1273-1276.
Pulkki, V, Merimaa, J & Lokki, T: Reproduction of Reverberation with Spatial Impulse Response Rendering, AES 116th Convention, Berlin, Germany, 2004, Preprint 6057.
Merimaa, J. & Pulkki, V: Spatial Impulse Response Rendering, 7th International Conference on Digital Audio Effects (DAFx'04), Naples, Italy, 2004. Invited paper.
Pulkki, V, Merimaa J. & Lokki T: A Method for Reproducing Natural or Modified Spatial Impression in Multichannel Listening. International patent application, filed March 2004
Part B - Comparison with the Joint Programme of Work (Annex I of the contract)
B.1 Research Objectives
The research objectives, as set down in Annex I of the contract, are still relevant and achievable. There are inevitable shifts in perspective and emphasis, to reflect scientific progress and the expertise and interests of the young researchers we have recruited.
B.2 Research Method
There were no additions to our methodological toolkit during the reporting period.
B.3 Work Plan
B3.1 Breakdown of tasks
We have made no changes to the task structure since year 1, though we recognise that the following table is looking somewhat dated.
B3.2 Schedule and milestones: Table 1
Note that here we are reporting on the work of the HOARSE teams, rather than the work of the young researchers alone.
Task / Lead Partner / 12 Month Milestone / 24 Month Milestone / Comments1.1 / Neural Oscillators for Auditory Scene Analysis / USFD / Multiple F0s using harmonic cancellation.
Initial implementation of binaural grouping / F0 tracking using continuity constraints / Multiple F0 work published [Wu, Wang & Brown 03]. Work on auditory selective attention published in IEEE tran. Neural networks.
1.2 / Modelling grouping integration by multisource decoding / USFD / incorporation of noise estimation into oscillator-based grouping / mask-level integration. / . Multisource decoding theory journal article published in Speech Communication
1.3 / Active/Passive speech perception / Liverpool / Planning experiments / Experiments conducted / Experiments conducted
1.4 / Envelope information and binaural processing / Liverpool / Preliminary experiments / Experiments and analysis / Experiments conducted
1.5 / Auditory Scene Analysis in Music / USFD / F0 estimation / Development of a two-stage (lower and cognitive) precedence effect model / Second system completed
2.1 / Researching the precedence effect / RUB / Psychoacoustic experiments on the precedence effect in realistic scenarios. / Development of a localisation model using automatic weighting function for binaural cues / Model completed [Faller & Merimaa 04]. Further psychoacoustical
experiments being conducted.
Some work at Patras on the relationship of Precedence effect and the Franssen illusion in conjunction with Bochum
2.2 / Reliability of auditory cues in multi-source scenarios / RUB / The importance of single binaural cues in various multisource environments determined in psychoacoustic experiments / Extension to multiple sources and practical room conditions / Completed [Braasch 03], Braasch et al 03], [Braasch & Blauert 03].
Research at RUB extended to spatial impression and separation of binaural
cues to source and room related.
2.3 / Perceptual models of room reverberation with application to speech recognition / Patras / integrated response/ signal perceptual model for single source in reverberant environments. / Extension for multiple sources / Significant part of the work completed
2.4 / Speech enhancement for reverberant environments / Patras / Research into auto-directive arrays, controlled from the perceptual directivity module / Development of new parameterisation techniques for the voice source / Some work completed (test interfaces ready) to be supplemented by subjective tests.
Missing data techniques for handling reverb developed at Sheffield
3.1 / Glottal excitation estimation / HUT / Research on combining new AR (Auto Regressive) models to inverse filtering / Inverse filtering experiments on intensity regulation of speech with soft and extremely loud voices / On schedule
3.2 / Voice production studies / HUT / Inverse filtering experiments on high-pitched voices / Research on the relationship between the main effects of the glottal flow (fundamental frequency, phonation type etc.) and brain functions using MEG. / On schedule
3.3 / Voice production and cortical speech processing / HUT / Development of DSP algorithms for parameterisation of the voice source, getting familiar with MEG / .Ongoing
4.1 / Developments in MultiSource Decoding / USFD / Probabilistic decoding contraints / Design of predictive noise estimation algorithms. Known BSS algorithms adopted as a common base for evaluation / Probabilistic decoding implemented in current software. Adaptive noise estimation implemented in multisource models
4.2 / Informing Speech Recognition / Liverpool / Design of predictive noise estimation algorithms. Known BSS algorithms adopted as a common base for evaluation / HMM2 & DBM adaptation / Work at DCAG and IDIAP
4.3 / Advanced ASR Algorithms / IDIAP / Multistream adaptation / Assessment report 1
Targets for assessment report 2 / Work reported on this task in Eurospeech 03, IEEE ASRU 03
5.1 / Speech recognition evaluation in multi-speaker conditions / DCAG / Database specification.
Targets for assessment report 1 / First recognition test in multi-speaker environment using separation algorithms (BSS and beamforming).
5.2 / Signal and speech detection in sound mixtures / IDIAP / Analysis of auditory cues / ASR performance for simulated deteriorated speech tested / Work reported: Ajmera et al 2003, Lathoud et al 2003
5.3 / Speech technology assessment by simulated acoustic environments / RUB / Simulation environment for hands-free communication developed / Completed and integrated into IKA telephone line simulation tool. ASR,
speaker recognition, and speech synthesis assessment experiments carried out.
B3.3 Research effort in the reporting period: Table 2
Participant / Young researchers financed by the contract (person-months) / Researchers financed from other sources (person-months) / Researchers contributing to the project (number of individuals)1. USFD / 12 / 48 / 1YR + 5 others=6
2. RUB / 15.5 / 30 / 2YRs + 4 others=6
3. DCAG / 12 / 24 / 1 YR + 2 others=3
4. HUT / 12 / 12 / 1YR + 2 others =3
5. IDIAP / 23.5 / 24 / 3YRs + 3 others=6
6. LIVERPOOL / 12 / 24 / 2YRs + 2 others=4
7. PATRAS / 7 / 6 / 1YR+ 1 other = 2
Totals / 94 / 168 / 11YR+19 other = 30
B.4 Organisation and Management
B4.1 Organisation and management
HOARSE is being managed in the way described in Annex 1 of the contract. The non-executive director is Dr. Jordan Cohen of VoiceSignal inc, Boston, MA. Administrative is being handled from USFD by Gillian Callaghan ().