1. Speechdat Spanish Database1

Speechdat(M) Spanish Database

Authors: / Asunción Moreno / Richard Winski
Institute: / Universitat Politècnica de Catalunya / Vocalis Ltd
Address: / Dept Teoria de la Señal y Communicaciones
Grupo de procesado de voz
c/ Gran Capitán s/n
Módulo D-5
08034 Barcelona. / Chaston House
Mill Court
Great Shelford
Cambridge
UK
CB2 5LD
email: / /
Date: / July 1996
Version: / Final

CONTENTS

1. Speechdat Spanish Database1

1.1 Introduction1

1.1.1 Speech File Format1

1.1.2 File nomenclature2

1.1.3 Directory structure3

1.2 Database design and collection5

1.2.1 Recording site and platform5

1.2.2 Speaker recruitment5

1.2.3 Design of prompting and prompt-sheet5

1.2.4 Transcription6

1.3 Database contents definition8

1.3.1 Isolated digits8

1.3.2 Connected digits8

1.3.3 Telephone numbers9

1.3.4 Number strings9

1.3.5 Money amounts9

1.3.6 Transcription of Digits, Numbers and Amounts.9

1.3.7 Spellings12

1.3.8 Times13

1.3.9 Dates14

1.3.10 Yes/no questions:15

1.3.11 Province names.16

1.3.12 Phonetically rich sentences16

1.3.13 Application words16

1.3.14 Additional application words17

1.3.15 Application phrases18

1.4 Deviations from Speechdat specification19

1.5 Demographic information21

1.5.1 Regions of Spain21

1.5.2 Speaker demographics24

1.6 Sample Prompt sheets25

1.6.1 Sample instruction sheet and prompt sheet25

1.6.2 Sample information and instruction sheet (English translation)28

1.6.3 Call server prompt scripts32

1.6.4 Algorithm to generate the sheet table34

1.Speechdat Spanish Database

1.1Introduction

The SpeechDat(M) Spanish database comprises telephone recordings from 1002 speakers recorded directly over the fixed PSTN using an E-1 interface at the recording site. There is also a pronunciation dictionary for the correctly spoken items. It was produced by a collaboration involving Vocalis Ltd and Universitat Politècnica de Catalunya (UPC) during the EU MLAP project LRE-63314 SpeechDat. Vocalis had responsibilities for the general Speechdat specification, for the recording site, platform and tools, and overall database production and coordination. UPC was responsible for the detailed contents design, speaker selection and coordination, pronunciation dictionary and orthographic transcription of the utterances, and documentation.

Unless otherwise specified the database conforms closely to the project specification document (deliverable D1.4.1) of which the relevant general specification has been extracted and included in the distribution (as file DESIGN.DOC). In addition this document supplements the general specification with material specific to the Spanish design, and also includes some notes related to the actual collection. A number of minor deviations from the general specification are described separately in section 1.4.

The database is available on 3 CD-ROM disks in ISO 9660 format. The three CD-ROM volumes are structured as follows:

CD00 - female speakers, all recorded material

CD01 - male speakers, all recorded material

CD02 - all speakers, phonetically rich sentence material only

The precise details of the distribution disks and directories are contained in the README.TXT file stored in each CD-ROM. Further details regarding the database contents, files and directories are provided in the documentation files and in the DOC, TABLE, INDEX and PROMPT directories.

1.1.1Speech File Format

It has been agreed to follow the ESPRIT Project SAM standards for speech file storage. Speech files are stored simply as sequences of 8-bit 8KHz A-law speech samples (before compression). Each prompted utterance is however stored in a separate file. Speech signal files have no header. Instead each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information. Full details of file formats are contained in DESIGN.DOC.

Version 1.2.4 of the GNU gzip compression and gunzip decompression software has been used to perform lossless compression of the speech signal files. This has been shown in tests to give reductions of about 45% when applied to the Spanish data. This software is available from ftp://src.doc.ic.ac.uk/packages/gnu/gzip-1.2.4.tar or from the Free Software Foundation Inc., 675 Mass Ave, Cambridge, MA 02139, USA. There is also an MS-DOS version of this software at this site, and it is available to Mac users also. Gzip can also be used for compression of general-purpose files and images.

Software routines to access speech files and perform decompression have been developed in the project. Current versions of these libraries for UNIX and DOS applications can be accessed at ftp://pitch.phon.ucl.ac.uk/pub/sam/samlib04.tar.gz for UNIX and .../samlib04.zip for DOS).

1.1.2File nomenclature

File names follow the ISO 9660 file name conventions (8plus3 characters) according to the main CDROM standard. The following template is used:

DD NNNN CC. LL F

where:

DD / Database identification code (00-ZZ)
For SpeechDat(M): A0=fixed network recordings
NNNN / Recording session progressive number (0000-9999)
CC / Corpus code (00-ZZ) obtained by collating the corpus and the item identifiers
LL / Two letter ISO 639 language code (ES=Spanish)
F / File type code
O=Orthographic label file, Z=compressed speech file

Table A - SpeechDat filename convention

As it is useful for users to clearly identify the speech file contents by looking at the filename we have specified the corpus code to support one letter corpus identifier and one letter item identifier, by the following table. All items are read, unless marked as spontaneous.

Corpus identifier / Item identifier / Corpus contents
I / 1 / 1 isolated digit
C / 1 / 4 digit id/sheet number
C / 2 / 9 digit telephone number / 4 connected digits
C / 3 / 16 digit credit card number / and numbers
C / 4 / home telephone number (spont.)
N / 1-2 / 2 natural numbers
N / 3 / 1 natural number with decimal point
M / 1 / 1 large amount / 2 money amounts
M / 2 / 1 small amount
L / 1-3 / 3 spelled words (7 letter sequences)
T / 1 / 1 time of day (spontaneous)
T / 2 / 1 time phrase (prompted, word style)
D / 1 / 1 date (spontaneous, the person's birthday)
D / 2-3 / 2 dates (prompted, word style)
Q / 1
2
3 / 3 yes/no questions:
Are you calling from the same province? (as P1)
Do you speak another language fluently?
Are you calling from a public phonebox?
P / 1 / 1 place; province of longest residence
A / 1-6 / 6 application keywords (out of a vocabulary of 54 words)
A / 7-8 / 2 additional application keywords (out of vocabulary of 18)
E / 1-3 / 3 embedded application word phrases (from A1-6 vocabulary)
S / 1-9 / 9 read sentences for phonetic coverage

Table B - SpeechDat corpus code convention

The proposed format uses mnemonic values. It permits selection of all files belonging to one of the twelve corpora by using one command (e.g. in DOS “dir /s/b ??????C*”, in UNIX “find. name”??????C*”print”).

1.1.3Directory structure

The directory structure uses a shallow directory nesting with contiguous numbers to identify the individual sub-directories and call directories. The following fourlevels directory structure is defined:

\<database>\<volume>\<block>\<session>

Where:

<database> / Defined as: <name<#<language code> i.e. FIXED0ES
Where:
<name> is FIXED indicating a fixed network database
<#> is 0 for SpeechDat(M)
<language code> is the ISO 2letters code ES for Spanish
<volume > / Defined as: CD<nn>
where <nn> is a progressive number from 00 to 02 for SpeechDatM, specifying the physical CDROM containing the material.
<block > / Defined as: BLOCK<nn>
where <nn> is a progressive number from 00 to 14. These numbers are the same as the first 2 digits used in <nnnn> described below.
<session > / Defined as: SES<nnnn>
Where <nnnn> is a progressive number in the range 0002 to 1419, being the numeric call identification number also encoded in each filename. As there are no more than 50 utterances per call, the total number of speech files and associated transcription files does not exceed the CD-ROM recommended limit of approximately 100 files in a directory.

Table C - SpeechDat directory structure

Both signal files and label files are put in the same directory.

All sessions have complete recordings for all prompted items with the following exceptions:

SES0066 has no utterance file or label file for Q3

SES0292 has no utterance file or label file for additional application words A7 and A8

In addition to the previous structure the following directories are used to store some other files:

\<database>\DOC / documentation files, including subword occurrence files
\<database>\TABLE / speaker and lexicon tables
\<database>\INDEX / index files - contents file
\<database>\PROMPT / prompt sheet tables
\<database>\SOURCE / source code for SAMLIB DOS/Unix file access routines

Table A - Non speechrelated directory structure

Finally the root directory contains three files:

a “README.TXT” ASCII file describing all files in the CD-ROM; signal and label files are reported by specifying their templates;
a “DISK.ID” ASCII file containing the volume name (11 characters long); it supplies the volume label to UNIX systems that are unable to read the physical volume label, e.g. “FIXED0ES_00”.
a “COPYRIGH.TXT” ASCII file to protect the authors rights.

All these support files are duplicated in each CDROM, except for the file contents index and summary files that are stored separately in each media.

1.2Database design and collection

1.2.1Recording site and platform

The recordings were made at the offices of Vocalis Ltd in Cambridge, UK. A primary rate E-1 DASS-2 circuit was installed at these premises for this task, and an international toll-free number from Spain was acquired which permitted callers to transparently access the system without encountering English PSTN messages at any stage. This method was more cost-effective for a 1000 speakers database compared with local recording. British Telecom provided full assurance that the integrity of the digitial transcmission was guaranteed from the point of presence in Spain and over the UK PSTN to the Cambridge site.

The recording platform was based on the company’s CallServer product. The host machine was a 486DX2 PC running SCO Unix with 16Mb RAM and 4 Gb hard disk storage. A Dialogic D81 telephony card and Aculab 1 E-1 interface card were used. Software was adapted from an existing analogue collection tool, based on a custom application generator. The Spanish speaker prompting scripts were provided by UPC and recorded in Cambridge. Maximum recording durations were set for each item according to expected durations, with a closing silence interval of 2-3 seconds otherwise. 8 channels were provided initially, and this was found to be adequate at all times. A secondary set of channels were used in the end stage of the collection to collect the final approximately 50 calls required to complete the full 1000 speaker coverage owing to a slightly lower call completion rate (80%) than originally anticipated, which necessitated some additional speaker recruitment work.

1.2.2Speaker recruitment

UPC recruited speakers from five dialectal areas. They defined the expected number of calls from each dialectal area attending to linguistics and economics aspects. Speakers were recruited from several Universities, mainly students and their relatives. A person in each site was in charge of the recruitment. The call was free. A prize was awarded to one of the callers, overseen by a lawyer. All speakers had to complete and sign a form which provided some identification details, and specifically provided for full ownership of the data for research and commercial purposes to UPC and their associates.

A participation rate of 47 % was achieved (in some areas 99%). In future it will be best to avoid collections in holiday times using this method. Also the participation rate shows that a 4 digits identification number is too small to distribute a number of sheets large enough to collect 5000 calls.

1.2.3Design of prompting and prompt-sheet

Prompt sheets are generated automatically from text files, as described in section 1.6.4.

A set of 8 spontaneous questions was included in the database. The name was an open question, in that it was considered inappropriate to prompt specifically for the full name for anonymity reasons, however callers were free to respond with their Christian name, family name(s) or both. A total of 2890 correct Christian names and family names were recorded. These are however withheld from the public distribution of the database.

The phonetic database was designed using sentences from texts slightly modified to get sentences that are meaningful, easy to pronounce and without foreign words. Interrogative sentences were not included because the intonation (their main characteristic in Spanish) is very poor when people read them. Each speaker utters a set of 9 sentences. The repetition of each set in the final collection is however not uniform owing to actual sheet distribution variation:

mean: 9 repetitions

median: 8 repetitions

maximum: 19 repetitions

minimum: 4 repetitions

Phonemic forms are obtained following the document:

'Spanish adaptation of SAMPA and automatic phonetic transcription'
SAM-A/UPC/001/V1 ESPRIT PROJECT 6819 (SAM-A)

The set of phonetically balanced sentences was automatically transcribed and manually checked by the Department de Filologia Espanyola of the Universitat Autonoma de Barcelona. Standard Castillian transcription was used. No dialectal variations were considered. The frequency of occurrence of monophones, biphones and triphones for the phonetically balanced sentences is provided in files in the DOC directory.

Spelling is rarely used in Spanish and it will be best to avoid spontaneous spelling in Spanish. Spelling letters were generated at random to avoid syllabic pronunciation. Nevertheless, some readers tried to pronounce jointly consonants and vowels when possible.

In practice identification number and credit card number were pronounced either as natural numbers or digit by digit. Very often the credit card number was uttered as four natural numbers instead of 16 digits.

1.2.4Transcription

Transcription was performed by native Spanish speakers under the supervision of UPC. A 2 day period of training was first given using material selected in advance for this purpose with many problematic transcriptions etc, during which transcribers were closely supervised. When these were transcribed without any errors transcribers were permitted to continue without further supervision. Transcription followed the SpeechDat project conventions with no variation and following the mandatory conventions, with no optional transcription markings.

The annotation package was entirely developed at UPC. The package automatically converts numbers into literal form and speeds up the typing of texts, especially spontaneous telephone numbers. A number of problems were encountered in the annotation process described briefly here. In future it will be useful to define additional transcription rules to address these points.

Names and city names from bilingual areas are very often pronounced in other language (Basque, Catalan). This led to some problems in the orthographic transcription and particularly in the phonetic transcription because some phonemes of these languages are not contained in the Spanish phoneme set.

Some transcription rules concerning mispronunciation words, unintelligible phonemes in words or noises in words (specially associated to the phoneme /s/) create ambiguities among transcribers. An additional validation work was done to eliminate such ambiguities.

Including prosodic marks is optional and they were not marked in this database. We consider however that pauses are important in the phonetic transcription when coarticulation is normally occurring between words and intend to include optional pause marks in our future work.

Some prosodic information is contained in the pronunciation lexicon, which includes the use of the single quote symbol to indicate stressing and period to indicate syllable boundaries. This is according to the Speechdat convention.

The rate of phonetically balanced, non-truncated sentences without marked noises ([Speaker_other], [Non_speaker_other] or [Filled_pause]) is 72%.

1.3Database contents definition

The final specification for the Spanish recordings is as follows:

1 isolated digit(prompted, word style)

2 connected digit4 digit sheet number

16 digit credit card number

2 telephone numbers1 spontaneous

1 prompted (9 digit)

2 natural numbers

1 number with decimal point

2 money amounts1 large amount

1 small amount

3 spelled strings of letters

2 times1 time of day (spontaneous)

1 time phrase (prompted, word style)

3 dates1 date of birth (spontaneous)

2 dates (prompted word style)

3 yes/no questions

1 city (province) name(spontaneous)

6 application words

2 additional application words

3 application phrases

9 sentences

1 spontaneous name

1.3.1Isolated digits

The speaker is prompted for 1 isolated digit. The set of isolated digits is {uno, una, un, dos, tres, cuatro, cinco, seis, siete, ocho, nueve, cero}

1.3.2Connected digits

Two sets of connected digits are included:

i)The sheet identification number (This is a number in the range 0-9999)

ii)a 16 digit credit card number

The 10 first digits of the credit card are generated at random, their sum is contained in digits 11 and 12 and digits 13-16 is the sheet identification number. Each speaker pronounces the identification number twice for security purposes.

Example: 7168 3853 57 53 XXXX

53is the sum of the preceding digits, xxxx is the identification sheet number

Digits are included in the items: digit, id.number, credit card and telephone number. Most of the speakers didn’t pronounce strings of digits but natural numbers, therefore there are not always one example of each digit spoken by each speaker.

1.3.3Telephone numbers

Two telephone numbers are included

i)Spontaneous

ii)Prompted 9 digits telephone number in typical Spanish style.

1.3.4Number strings

Each speaker pronounces 2 natural numbers and a number with decimal point.

The set of numbers has been designed satisfying the following constraints:

- all the words that form part of numbers are included;

- for every word, all the contexts with phonetical relevance are considered;

- it is intended to balance the appearances of both words and contexts.

1.3.5Money amounts

There is one small money amount chosen among 222 different quantities randomly distributed in the range 1-50000 pesetas.

There is one large money amount chosen among 222 different quantities randomly distributed in the range 100000 - 8000000 pesetas multiples of 1000.

1.3.6Transcription of Digits, Numbers and Amounts.

Spanish numbers N, 0  N < 1012 are formed by the concatenation of words included in TABLE N-1 in a similar way as in English. Tens and units are concatenated by the particle y except when tens start by 1 or 2.

million / thousand / hundred / 30...90 / 20-29 / 10-19 / units
millón / mil / cien / treinta / veinte / diez / cero
millones / ciento / cuarenta / veintiuno / once / uno
cientos / cincuenta / veintidós / doce / dos
sesenta / veintitrés / trece / tres
setenta / veinticuatro / catorce / cuatro
ochenta / veinticinco / quince / cinco
noventa / veintiséis / dieciséis / seis
veintisiete / diecisiete / siete
veintiocho / dieciocho / ocho
veintinueve / diecinueve / nueve
y / veintiún / un
cientas / veintiuna / una

Table N-1

Example: 56 896 78

cincuenta y seis millones ochocientos noventa y seis mil setecientos ochenta y cinco