Deliverable 2.2: Report on Technical Requirements

Deliverable 2.2: Report on technical Requirements

0 Introduction

This report is the result of work carried out as part of Work Package 2 as described in the proposal for the EU Lifelong Learning Project DigLin (the Digital Literacy Instructor).

1 AuTOMATIC SPEECH RECOGNITION System Architecture

Feedback on reading aloud will be integrated into the existing FC-Sprint2 system (See deliverable X) using the architecture depicted in Figure 1.

Figure 1. Integrating feedback on reading aloud in the FC-Sprint platform.

1.1 CLIENT-SIDE

The user interacts with the existing FC-Sprint2 platform through the mouse and keyboard (see Figure 1). For starting/stopping audio recording and passing data relevant to the current exercise, the FC-Sprint2 program communicates with the ASR (Automatic Speech Recognition) Client Module (implemented in JavaScript). This ASR Client is also responsible for recording audio and communicating with a server, on which the actual speech processing is performed. The API (Application Programming Interface) of the client module (offered to FC-Sprint2) has yet to be defined.
In FC-Sprint2, the recording of audio input is started by clicking a button and stopped automatically after the word is spoken. Because most web browsers do not yet offer native audio recording, we will use Flash to implement activate the audio recorder. For this reason, the ASR functionality in the application can currently not be used on a mobile device. Once native audio recording in mobile browsers can will be supported, the only thing that needs to be changed in this architecture to make the application compatible with mobile devices is will be the audio recorder. For connectivity to the (proxy) server, we will use WebSockets, as this will almost certainly become the future standard for client-server communication in web applications.

1.2 Server-SIDE

The proxy server receives the ASR tasks and the audio. As seen in Figure 1, these data are sent to an instance of a speech processor. During a session, a speech processor is assigned to each user. When the audio is processed, the resulting feedback and a link to an MP3 and OGG version of the audio is sent back to FC-Sprint2 (through the proxy server and the ASR Client Module).

2 SPEECH PROCESSOR

The speech processor is the system component that calculates the feedback for the user, based on the recorded audio output from the user. Currently, we envisage implementing two types of feedback:

· A score between 0-100 reflecting the ‘accentedness’ or ‘intelligibility’ of the word. The meaning of this score has yet to be defined by the consortium (See Section 2.1).

· The most serious reading error, mapped to the corresponding grapheme. Within the consortium, we still have to define what ‘most serious’ means (See Section 2.2).

We will now describe possibilities for implementing these types of feedback.

2.1 READING ALOUD SCORE

There are several ways in which a reading score for a spoken word or sentence can be calculated. We have broken this down into two steps:

Determine the sequence of phones that was uttered by the userlearner.
Calculate the distance between the canonical sequence of phones and the realized sequence of phones.

For example, if the user has to read aloud the English word ‘map’, which has the canonical phonetic transcription /mæp/, but the userlearners realizes the word as /[mɛpə]/ (as determined in step 1) we calculate the distance between /mæp/ and [/ mɛpə] / to obtain a quality score for how the word was read aloud (this can easily be generalized to sentences).
We will now discuss the two steps in more detail.

Step 1: Phone Recognition

A phone recognition system uses two main resources:

Acoustic Models: Statistical representations of the sounds that make up words. In our case, these models are language-dependent and trained on a speech database. For example, we will have a model that represents how the /tʃ/ ‘sounds’ in English.

Language Model: Statistical model that represents the probability of a sequence of words or sounds. For example, the sequence ‘I ate a cherry’ is more likely than ‘Eye eight uh cherry’ (given approximately the same acoustic observations) and this can be encoded in the language model. This kind of information can also help in recognizing the sequence of phones (instead of words) that was spoken. For example the sequence /tʃ æ/ is more likely to occur in English than /tʃ ʒ/. The probabilities of these phone sequences can be interpreted as phonotactic constraints imposed on the recognition process.

The output of the phone recognizer is a sequence of phones. To determine how much this sequence differs from the canonical form of the word/sentence, a distance measure needs to be defined. We will have to experiment with (parameters of) different acoustic and language models to optimize this step.

Step 2: Distance calculation
A distance measure between the two phone strings can be implemented using the Levenshtein distance. The Levenshtein distance between two strings is the minimum number of edit operations (insertions, deletions and substitutions) required to change one word into the other.
In our example, where alignment of the phone strings gives:
TARGET m æ p -
REALIZED m ɛ p ə

These edit operations would be:
- Substitution of /æ/ with /ɛ/
- Insertion of /ə/
Now the question arises how we should quantize the penalties for these ‘errors’. The penalties will have to be defined for all the possible substitutions, insertions and deletions in all four languages, Dutch, English, Finnish, German. An automatic way of defining penalties is to calculate the Kullback-Leibler divergence between the acoustic models. The Kullback-Leibler divergence is a measure of the difference between two probability distributions. Applied to acoustic models, this distance measure can be interpreted as the ‘acoustic distance’ between the different phones. For example Figure 2A visually depicts these distances in two dimensions using multidimensional scaling (for Dutch vowels). As can be seen, the relations between vowels are roughly similar to those observed in charts based on formant measurements (see Figure 2B).

(A)

(B)

Figure 2. (A) 2-Dimensional projection of Kullback-Leibler divergences between acoustic models of Dutch vowels. The Kullback-Leibler divergences are estimated through Monte-Carlo sampling. (B) Vowel chart based on formant measurements in [Adank].
Such an acoustic distance can also be defined in other ways, for example based on results of perception experiments or by using phonological features. An example of a phonological feature table is:

Figure 3. Phonological features model from Cucchiarini 1993

Using this feature table, several heuristics could be applied to determine the difference between two phonemes. One approach is to simply calculate the proportion of features that differ between the phonemes. For example, for æ and ɛ this would be 2/8=0.25. It is not obvious which approach would lead to the best/most ‘intuitive’ results. The proposal in this section is under continuing discussion with the consortium.

2.2 MOST SERIOUS READING ALOUD ERROR

For giving feedback on the most serious reading aloud error we need to carry out two tasks:

1. detect all relevant reading errors and

2. select the most serious one.

The reading errors will be detected by employing Finite State Transducer language models that encode errors. For example, for the word ‘map’ (/mæp/) in which we want to give specific feedback on possible errors regarding the /æ/ sound, which could be often mispronounced by learners as /ɛ/ and /ɑː/, the resulting FST would be:

Figure 4. Finite State Transducer for the word "map"

Using this type of recognition setup, the recognizer is forced to choose a path through the language model. The reason for not detecting errors using phone recognition is that phone recognition is too error-prone for providing feedback on specific sounds. Note that in this FST approach, reading errors need to be predicted beforehand. The most serious error could be selected by choosing the one with the largest penalty, as defined in Section 2.1.