Design of a Multimedia Corpus of Austronesian Linguistics

Design of a Multimedia Corpus of Austronesian Linguistics 1

Design of a Multimedia Corpus of Austronesian Linguistics

Zhemin Lin, Li-May Sung, I-wen Su

Graduate Institute of Linguistics of the NationalTaiwanUniversity

AbstractIn this paper, the design of an integrated platform of multimedia online corpora aiming to serve both linguists and the public is introduced along with database schema and programming details. Compared with the Formosan language archive of Academia Sinica, our design emphasizes more in terms of normalization, accessibility and interoperability of the system. The design of an automatically generated dictionary with cross-references and the capability of searching the entire database in various ways are also described here.

1Introduction

The development of natural language processing techniques and dynamic web pages has generated wide interest in the construction of an integrated platform which enables people to submit, to browse and to search among collected texts in corpora. However, most online corpora are specially built for experts; they are sentence-based and do not provide multimedia contents. The NTU corpus of Austronesian languages[1] introduced in this paper is an attempt to construct a multi-lingual online corpus with multimedia contents meeting the needs of both linguists and the public. In the following sections, we will take a brief review of previous works and then focus on the features of our current work.

2 Formosan Language Archive of Academia Sinica

Zeitoun et al. (2003) discussed some of the problems in the conservation of Formosan Austronesian languages. The continuous enhancement of their work with many newly designed tools is further described in Zeitoun and Yu (2005). As discussed in the two articles, fieldwork data are rarely shared in the linguistic community. Collected materials are sometimes inaccessible even in the office where they are stored, due to the change of storage media or data damage. One of the most serious problems is that, although there are elicitated sentences and recordings, few of them are rearranged and published. As a response to the problems, researchers in the Academia Sinica have built a Formosan language archive, i.e., an online corpora with texts, translations, word glosses and sounds from native speakers of 14 languages and dialects.[2]

Despite their labour, there are however insufficiencies in their system, one of them being the theoretical issue: the Sinica corpora are sentence-based, where pauses, pause fillers, repetitions, intonation contours, IU boundaries and other discoursal clues are either discarded or missing. A sentence-based corpus excludes important linguistic information only present in discourse. Words in the system are written in an ad hoc mixture style via International Phonetic Alphabet (IPA), in a transcription style that prevents their respective native speakers from using the data directly. Nearly every word is alteredto some extent. Example (1) is a Saisiyat example extracted from the Sinica archive.

(1)

(a)yao noka maʔiiæh ... hayðaʔʔæhæʔ maʔiiæh la m-waaiʔ, yao mina-ŋaʔŋaʔ nak hini mina-ʃaaəŋ.

(b)ʔinʔalay hikor may nak hini yakin, ʃβət yakin ho.

(c)ʔok-ik ʃəβət, m-waaiʔ nak hini pa-paʃœʃ, yao h<œm>ʃœʃ atomalan.

(Extracted from 05.002a -- 05.002c of “5.我的故事” of the Sinica archive.)

There is so far no dictionary available with cross-referencing function in the Sinica corpus, even though cross-referencing for an online corpus is essential for researchers deal with elicited or authentic data. Like KWIC (Keyword-in-context, cf.Luhn (1960)), a user can trace a word back to the context where it occurs, and browse its surrounding IUs. Zeitoun et al (2003) has planed a data schema that ran on Microsoft Access. Their design, however, cannot take advantage of the SQL92 query language. Moreover, they designed an XML dialect to improve the interoperability, which does not encourage researchers to share their collected data in a convenient way. The Sinica archive, though primitive in design, is the first attempt to provide public access to the nearly extinct linguistic data, which is an effort highly respectable by itself.

3NTU Corpus of Austronesian Languages

The system designed in this paper is based on the NTU corpus of Austronesian languages. The NTU corpus, first described in Huang, Su, and Sung (2003), is composed of spoken texts in various languages. Currently NTU Saisiyat corpus contains 22 texts, 3081 intonation units (IUs) and approximately 10635 words, whose transcription follows the conventions of Du Bois (1993). There are one conversation, eight narratives of indigenous legends, thirteen elicited narratives based on “Pear Stories” (5 narratives based on a six-minute color mute film made by Wallace Chafe, see Chafe (1980)) and “Frog Stories” (8 narratives from a sketch book by Mayer (1980)). An example of an original data segment follows:

(2)

... (1.7)m-wa:i''aehae'ka
af-comeonenom
...(1.1)ma'iaehimah<oem>oehoe'kasiri'
personaspaf>pullaccgoat
... may hiza
pass.by[af]there
...(1.9)ilahizakabih
move.to.that.placeside

“(The man pulling a goat) passed by this way and went that way.” (Pear 3:9-12)

Spoken corpus, in contrast to written corpus, is composed of utterances shorter or equal to sentences, which are transcribed according to certain criteria, such as turn-taking, pause, and ruptures in intonation contours of monologue (Tao 1996:35). Fig. 1. shows a unified intonation contours in a praat[3]window.

Fig. 1. A unified intonation contour

When a corpus is transcribed, tagged and analyzed, one needs to look for a means to make it accessible to the public. An integrated platform to store, to represent and to lower the technological boundaries for further use of the collected data is thus necessary. With the insufficiencies of the Sinica archive in mind, normalization, accessibility and interoperability are emphasized in the design of our system. The following guidelines are thus proposed.

(3) Guidelines of the integrated platform

(a)Easy to customize for most Austronesian languages

(b)Standardized procedures of transcription, annotation and process

(c)Automatic extraction of morphosyntactic information to reduce repetition of human labor

(d)Web-based, unified input/output interface

(e)Searchable corpus that fit the needs of both linguists and the public

(f)Multimedia representation of collected texts

(g)Interoperable with other systems

(h)Cross-platform, operating system independent

Below is a description of the input, processing and output of our system design.

3.1Standardization of text commitment and standards of committed texts

The standardization comprises the procedure of handling transcribed texts, the transcription itself and morphosyntactic and discoursal codes used in the transcription. The procedure to handle collected texts is designed with low coupling in order to reduce complexity. Therefore, the dependence in human manipulation in the system is almost uni-directional, as can be seen in Figure 2. Whenever a spoken text is collected, some worker transcribes it. Once the transcription is complete, it is given to the database maintainer for processing and storage. The web interface shows the corpus in the database, so that people on the other end of Internet can browse and search the corpus.

Fig. 2. Use cases of the system

The transcription follows Du Bois (1993), a de facto standard in the linguistic community. Word glosses and annotations follow a standardized coding list inherited from conventional mark-ups (cf. Appendix A) and the Leipzig glossing rules[4]. A standard operation is also set for the database maintainer to handle fieldwork collections as shown in Figure 3.

Fig.3. Standard operation of text commitment

The corpus in our system is stored in Unicode (UTF-8 encoding) for potential need of IPA, Japanese, and annotations in other languages. If some of the tribes decide to adopt non-ASCII letters, such as “ ɖ ʈ ɼ ɫ ʔ ”, into their writing systems, the programs can process them correctly with no need of modification. As Unicode BOM (byte-order-mark, U+FEFF) is appended in the beginning of the file in Microsoft Windows and is absent in Unix-based systems, the mark may cause a potential problem in reading files edited in different operating systems. It is properly dealt with in order to fulfil the criteria of platform-independence.

A set of metadata is defined in the head of committed files. An example of the head of a committed text is given in (4) and the description of the fields is shown in Table 1.

(4)

Topic: Pear story

Type: Narrative

Language: Kavalan

Dialect: Xinshe

Speaker: Imui,潘金妹,F,1952

Time: 00:01:15

Total IUs: 31

Collected: 2003-05-30

Revised: 2003-11-11

Transcribed by: 葉俞廷,王以勤

Double checked: 鍾曉芳,沈嘉琪,葉俞廷

Table 1. Metadata of committed data

Field name / Description / Format
Topic / Topic of text / String (e.g., Pear Story)
Type / Style of text / Narrative|Conversation|...
Language / Language of text / String, first letter in capital
Dialect / Dialect or district / String
Speaker / Base data of the informant / Native/Chinese name, Gender, Age
Time / Length of recording / hh:mm:ss
Total IUs / Number of IUs in text / Numeric
Collected / Date of recording / yyyy-mm-dd
Revised / Date of latest revision / yyyy-mm-dd
Transcribed by / Transcribers and annotaters / Comma separated string
Double checked / Inspectors of text / Comma separated string

The text following the metadata is described below.

(5)

5.[IU #, with a period in the end]

.. qay- .. qay-byabas 'nay ,_ [words separated by spaces]

QAY-guava that [English gloss separated by spaces]

QAY-芭樂那 [Chinese gloss separated by spaces]

... razat 'nay nani.\

person that DM

人那 DM

#e That person picked guavas. Then,

#c 那個人採芭樂。然後，

#n Elicitaion notes

#n (More elicitation notes)

Lines beginning with a sharp (#) are processor instructions (PI). “#e” indicates a line of English translation of a paragraph composed of the IUs from the last translation to the current one. “#c” marks a Chinese translation, and “#n” is elicitation notes. It is possible to have more than one note. The alignment of native words and glosses is automatically done. Morpheme boundaries, morphological information and word senses are extracted using the techniques introduced in Lin (2005: Chapters 2 and 4.2).

As the transcription is supposed to more or less reflect actual pronunciation of an informant, spelling may vary slightly from word to word. For the system not to be confused by these variations, a feature vector is configured for each formosan language. A vector describes how to reduce the variants into a simpler form. For example, the pronunciation of a and ae is quite similar in Saisiyat, and glottal-stops are sometimes omitted. 'aehae'“one” is usually spelled 'ahae or aehae. Below are feature vectors of Saisiyat and Kavalan.[5]

Saisiyat:ae → a,oe → o, S → s,' → ∅

Kavalan:th → l,d → l,' → ∅

A string substitution is executed before any operation in the database in order to prevent possible duplicated entries; otherwise full-text search may fail to work.

3.2Database design

Database design affects the efficiency in search and storage. For simplifying programming logic and high-speed query, we proposed a schema that differs from the Sinica archive. Every relational database engine that follows the SQL92 standard can be used in the implementation of the schema. SQLite[6], among relational database systems, is recommended for the following reasons:

It is light-weight, fast and platform independent.
A database is stored in a single file, thus is easy to maintain.
It supports UTF-8 encoding.
It is a free software.

One formosan language is placed in one database and is thus stored in a single file. The schema of every language should be the same, therefore cross-linguistic search can be executed in a single page. It is often argued that a database has to be normalized to the third level.[7] To be realistic, our system is designed for the sake of efficiency. The relational diagram of tables in the database is shown in Figure 4. A full list of database schema is given in Appendix B.

Fig.4. Relational diagram of tables in the database

The text is mainly stored in Table “iu”. In contrast to the word-based design in the Sinica archive, every intonation unit is stored in one row. For example,

article: pear3

nat: ...(1.2) ima h-oem-angaw kasna'itol ray kahoey babaw

sim: . ima homangaw kasnaitol ray kahoy babaw

eng: . Asp set_a_ladder-AF move_up-AF Loc tree above

For a full-text search, a simple query of “%keyword%” to every field listed above returns the correct results. The simplified spelling is stored for searching among spelling variants. Words in the database are separated by a single space, so that they are easily processed in programs by a single function (explode() in PHP and split() in Python). Places where no gloss is available are occupied by a period (“.”); thus, words and glosses are always aligned across the fields.

Another specialized data structure is designed in Table “lemma”. In order to properly search an affix, the stem is marked for every word in the dictionary. The morpheme before the stem is a prefix and the one after it a suffix. For example, Saisiyat kapapama'an 'bicycle' is stored as ka-#papama'#-an in the table. If one looks for a prefix ka- or a suffix -an, one can always obtain the right answer by taking the elements before the first sharp (#) or after the second sharp. Since infixation is simple in the two languages, it is currently analyzed on the fly by external programs.[8]

3.3Back-end programs and the POS-tagger

Database maintainer commits a pre-processed transcription into the database through a batch of back-end programs. Commitment is preferably done in the command-line, so that mismatches in alignment or failure of automated morphological analysis may be corrected immediately and interactively. A prototype is implemented to prove the workability of the system. Here is a list of programs.

features.py

defines language-specific feature-vectors and provides connection DSN.

simplify.py

is the common library for reducing spelling variants.

canon.py

checks input validity, including metadata and text format. It writes the data into the database when the check passes.

extractmorph.py

defines morphological and discoursal codes and extracts them from the texts.

makedict.py

extracts information from imported texts and updates the dictionary.

mp3splt.py/mpgsplt.py

splits .mp3 / .mpg files according to the time-file (see below).

tidy.py

utility to convert Chinese punctuation into ASCII and remove unnecessary Microsoft Word mark-ups.

The coupling of the modules is fairly low. “features.py” and “simplify.py” provide the necessary functions for all programs.

As texts have been put into the database, they are tagged by a TBL tagger (cf. Lin 2005:Chapter 2), and the dictionary is updated at the same time. When a user looks up a word, the part-of-speech information can be obtained along with its frequency in the corpus. Any time the database maintainer finds an error in the tagged corpus, it can be corrected on-line as an immediate feedback to the tagger. The tagger can later be retrained by a single click.

3.4Unified output interface

For the corpus to be accessible to the public, a unified user-friendly interface is built. The system follows HTML 4.01 (loose) proposed by the World-Wide Web Consortium[9] and is designed to be browsed with a browser, because this is one of the major means to access data from the Internet. For a dynamic and interactive representation, the Document Object Model (DOM)[10] and JavaScript 1.2 are preferably used. Popular browsers, such as Internet Explorer 5.0, Mozilla 1.7, Firefox 0.9 and Opera 4, are compliant to these standards. It is important to support major browsers for the purpose of accessibility. Figure 5 is a screen dump of the web site under construction.

Fig.5. Screen dump of the web site (under construction)

The interface is composed of the following parts: a window with the informant's photo where movie clips are played, a list of metadata in the upper-left corner, several switches to adjust browsing effects and a frame in the bottom of the screen to dump the selected article in a format following linguistic convention. A dictionary is popped-up anytime when a user clicks on an unknown word (see Figure 6). The pages are being revised for a better visual effect.

Ethnological notes and examples are preferably given in the dictionary with cross-reference. The design for an interface for searching is simple, yet complicated and special linguistic needs are still possible. For example, by typing tabatathana user can find the occurrences of the Kavalan word ta-batad-an; typing 'ahae or aehae results in 'aehae' for Saisiyat, and so on. Interfaces to user-defined functions (UDF) are also kept for further improvement.

Fig. 6. Pop-up dictionary with cross-references

As the bandwidth is quite limited, it is suggested that multimedia data are stored and transferred in the formats of 16Kbps 11kHz MPEG-1 layer 3 for audio data and MPEG-1 for video data.

3.5Interoperability

It is important to share the corpus with the linguistic community. The Extensible Mark-up Language (XML)[11] is a simple and flexible language used to exchange data between different systems. It is now a de facto standard on the web. For researchers of natural language processing to easily profit from our collected data, the corpus should be able to be exported in XML. Morphological information, gloss and part-of-speech of every word may be output in a uniform manner. An exported format is given below.

<?xml version="1.0" encoding="utf-8" ?>

<topic>Pear Story</topic>

<language>Kavalan</language>

<dialect>Xinshe</dialect>

<age-of-record>51</age-of-record>

</speaker

<total-iu>31</total-iu>

<text>

<word>

<nat>tangi</nat>

<sim>tangi</sim>

<eng>today</eng>

</word>

<word>

...

</word>

</iu>

...