WriCLE:

A learner corpus for Second Language Acquisition Research

Amaya Mendikoetxea

Department of English Philology

Universidad Autónoma de Madrid

Michael O’Donnell

Department of English Philology

Universidad Autónoma de Madrid

Paul Rollinson

Department of English Philology

Universidad Autónoma de Madrid

Abstract

The validity and reliability of Second Language Acquisition (SLA) hypotheses rest heavily on adequate methods of data gathering. In this paper, we analyse the reasons why many SLA researchers are still reticent about using corpora and how good corpus design and adequate tools to annotate and search corpora could help overcome some of the problems observed.We do so by describing the key design principles of a learner corpus we are compiling (WriCLE: Written Corpus of Learner English), the way data has been collected and how it is being annotated using UAM CorpusTool. We then present some studies based on the WriCLE corpus and show how our corpus (and software) can be used to test current hypotheses in SLA. Our paper concludes with a brief analysis of future challenges for corpus-based SLA research.

1.Introduction

Much current SLA research relies on elicited experimental data and disfavours natural language use data. This situation, however, is beginning to change thanks to the availability of computerized linguistic databases and learner corpora. The area of linguistic inquiry known as ‘learner corpus research’ has recently come into being as a result of the confluence of two previously disparate fields: corpus linguistics and Second Language Acquisition (see Granger 2002, 2004; Barlow 2005). Despite the interest in what learner corpora can tell us about learner language, studies in corpus-based SLA research are mostly descriptive, focusing on differences between the language of native speakers and the learners’ interlanguage(s), as observed in the written performance of advanced learners from a variety of L2 backgrounds (see e.g. Granger 2002, 2004 and Myles 2005, 2007 for a similar evaluation). The aims of this paper are: (i) to evaluate the impact of learner corpus research in recent SLA theory; and (ii) to describe the WriCLE corpus and its contribution to this area of inquiry, as a corpus designed for SLA purposes.

We first present the case for the use of corpora in SLA research and evaluate the impact of learner corpora within this field of SLA studies (section 2). We then focus of the WriCLE corpus: key design principles, data collection and annotation (section 3). In section 4, we describe briefly two research projects that use WriCLE for SLA hypothesis-testing purposes and another project which uses WriCLE for pedagogical purposes, in particular, to profile the grammatical competence of each learner proficiency level to facilitatecurriculum design withinForeign Language Teaching (FLT). Finally, in the concluding section (section 5) we point out what we believe is the way forward for learner corpus research.

2. Corpora in SLA: why we need them

2.1 Gathering learner data

As pointed out by Granger (2002: 5) “much current SLA research favours experimental, metalinguistic and introspective data, and tends to be dismissive of natural language use data.” In order to understand why this is the case, we have to look at the purpose of SLA research. The main aim of SLA research is to build models of the underlying representations of learners at a particular stage in the process of L2 learning and of the developmental constraints that shape and constrain L2 production. The central source of evidence for these mental processes is the language produced by learners, whether spontaneously or through a variety of data elicitation procedures, as Myles (2005: 374) states. Findings in SLA depend highly on these procedures and the success of SLA research relies crucially on access to good quality data.

Several reasons can be given for why elicitation techniques are favoured in SLA research. For instance, Mackey & Gass (2005) provide the following reasons why metalinguistic data may be used in SLA research, as opposed to natural language use data: (i) the particular structure you want to investigate may not occur in natural production: it may be absent or there may not be enough instances, and, conversely,(ii) to answer your research question you may need to know what learners rule out as a possible L2 sentence: (a) presence of a particular structure/feature in the learners’ natural output does not necessarily indicate that the learners ‘know’ the structure, and (b) absence of a particular structure/feature in natural language use data does not necessarily indicate that learners do not ‘know’ the structure. An additional reason is provided by Granger (2002: 6): it is difficult to control the variables that affect learner production in a non-experimental context. The consequence of all this is that the empirical base of SLA research tends to be relatively narrow, based on the language produced by a very limited number of subjects, which, as pointed out by Granger (2002: 6), raises questions about whether results can be generalised.

Case studies and small-scale studies have greatly served the hypothesis-building endeavour in SLA research, but there are now many researcher who feel that the time has come to test hypotheses on larger and better constructed databases to see whether findings can be generalised (see Myles 2005) and to discover sets of data nor normally found in small studies which can become crucial in order to inform currents debates (e.g. what aspects of grammar are more vulnerable to transfer or cross-linguistic influence). Theseare the main reasons for the use corpora in SLA research, to which we can add another two which are common to corpus linguistics in general as a field of inquiry: to discover patterns of use and for quantitative studies (e.g. frequency).

2.2. Learner corpora in SLA research

Studies using learner corpora in SLA fall within two categories (i)hypothesis-driven/corpus-based studies and (ii) hypothesis-finding/corpus-driven studies (see Granger 1998, and Tognini-Bonelli 2001). According to Barlow (2005: 344), the former involve using learner corpus data to test specific hypotheses or research questions about the nature of learner language generated through introspection, SLA theories, or as a result of the analysis of experimental or other sources of data. The latter involve investigating learner corpus data in a more exploratory way to discover patterns of data, which may then be used to generate hypotheses about learner language.The majority of studies within the area of learner corpus research fall within category (ii), as revealed by an analysis of the papers collected in recent edited volumes within the field (e.g. Granger et al. 2002 and Aston et al. 2004). For instance, Altenberg (2002) and Aijmer (2002) are good examples of the kinds of studies that corpus linguistics has made easier to carry out, according to Myles (2005): they rely on large written corpora and they focus on the use of discrete items and in different cross-sectional populations: typically one or more L1s compared with native use. Hypothesis-driven, corpus-based studies are hard to find (two examples from the volumes mentioned above are Housen 2002 and Tono 2004).

As Barlow (2005) points out, there are biases in practice: with the experimental/generative tradition favouringcategory (i) studies, and corpus linguistsfavouringcategory (ii) studies, but on the whole, it can be said that the contribution of learner corpus research so far has been much more substantial in description than interpretation of SLA data (Granger 2004, Myles 2005). According to Granger (2004: 134-135), this is because learner corpus research has been mainly conducted by corpus linguists, rather than SLA specialists (Hasselgard 1999), and the type of learner language corpus that researchers have been most interested in (intermediate to advanced) was so poorly described in the literature that they felt the need to establish the facts before launching into theoretical generalisations. That is, most work is rather descriptive, documenting differences between native and non-native English, rather than explaining. Additionally, there is very little or no reference to current debates and hypotheses in the SLA literature (Myles 2005). However, as pointed out by Myles (2005: 381): “such research is useful nonetheless, as we need to have good descriptions of learner language in order to inform our understanding of what shapes its development, but it is now time that corpus linguists and SLA specialists work more closely together in order to advance both their agendas”.

3. WriCLE: A Written Corpus of Learner English

The WriCLE corpus was created as part of the WOSLAC project (Word Order in Second Language Acquisition Corpora),whose purpose was to investigate the lexical, syntactic and discoursive properties affecting word order in L2 English and L2 Spanish (see section 4 below for details).[1] In this section, we first describe the key design principles of WriCLE, as a corpus primarily intended for SLA research purposes. We then comment briefly on data collection and annotation.

3.1. Key design principles

3.1.1. Principle 1: Focus on written language

Though the majority of available learner corpora are written, there is some debate among learner corpus researchers as to whether written data, as opposed to spoken data, is a good representation of the learners’ underlying mental grammar, i.e. there are questions concerning the suitability of written production data for the testing of SLA hypotheses. Mitchell et al. (2008), for instance, advocate the use of spoken learner corpora, as spontaneous speech in naturalistic or semi-naturalistic settings is likely to provide more direct evidence of the L2 learner’s underlying interlanguage system. However, written corpora are often used to study native grammars and are considered to be a good reflection of language competence for language speakers, so, why should that not be the case for L2 learners?

Without understating the validity of spoken data, there are several reasons why we believe that written data is adequate for SLA research. First, there are likely to be fewer performance errors in the written language and the errors found are those that escape monitoring, indicating grammatical or lexical gaps in the learners’ mental grammar. Second, learners tend to use more complex structures when they are writing, which could be more revealing in terms of their linguistic competence than the simplified language often found in oral language. Finally, written corpora are particularly suitable to study the features of the interlanguage of advanced learners, especially in comparison with similar corpora of native speakers: (i) learner corpus research in the ICLE tradition (Granger et al.2002) shows that advanced learner texts are a valuable source of data to study aspects such as modality, degree adverbs, tenses, collocations, phraseology, the expression of causativity, information structure, clefts, anaphora, etc. (see also section 2.2 above), and (ii) written corpora can also be used in hypothesis-testing studies: e.g. passivised structures and expletives (Oshita 2000, 2004) and the study of subject inversion in L2 English (Lozano & Mendikoetxea 2008, 2009, in press) (see section 4.1 below).

3.1.2. Principle 2: Authenticity

Following Sinclair’s (1996) definition of what is a corpus, Granger (2002: 7) defines learner corpora as: “electronic collections of authentic FL/SL textual data according to explicit design criteria for a particular SLA/FLT purpose” (FL: Foreign Language; SL: Second Language). But as Granger (2002) herself points out, learner data is rarely fully natural, especially in the case of EFL learners, who learn English in a classroom. Some researchers talk about a scale of naturalness: fully natural – product of teaching process – controlled task – scripted(Nesselhauf 2004:128). In general, the more intervention by the researcher, the further away we are from ‘authentic’ data

The kind of texts compiled in WriCLE are argumentative or discussion essays written by learners of L2 English (L1 Spanish) outside the classroom environment for the Academic Writing component of English Language I and English Language III courses of a degree in English Studies in a Spanish university (see § 3.2 below for details). As such, they constitute ‘authentic’ (written) learner data: “data resulting from authentic classroom activity” (Granger 2002: 8).

3.1.3. Principle 3: Variety of learner levels

The corpus includes makes provision for learners at six different proficiency levels so as to maximize its usefulness to study development in L2 English. A standardised proficiency measure is used: the levels correspond to those of the Common European Framework of Reference for Languages[2] and are determined by the score obtained in the Quick Placement Test(UCLES 2001) taken by each of the learners and available to researchers in a database that accompanies the corpus.[3] This is essential for SLA research: experimental and metalinguistic studies in SLA always provide a proficiency measure, as hypotheses and generalisations depend greatly on the learners’ proficiency level. Any corpus designed for SLA purposes must include a formal measure of proficiency, crucially, for cross-sectional or longitudinal studies but also when the analysis involves learners at a particular developmental stage.

3.1.4. Principle 4: Documentation

Information about learners and task is crucial for SLA research. As part of the WriCLE project, each learner fills in a Learner Profile questionnaire and for each essay there is an Essay Profile questionnaire (one profile per essay as a learner may contribute more than one essay) (see section 3.2 below). All this information, together with information about proficiency level (see section 3.1.3 above), year of collection, year of study, etc. is stored in a spreadsheet (Microsoft Excel 2007) and will be available to researchers together with the corpus.

3.1.5. Principle 5: Homogeneity

One of the criteria for corpus design in Sinclair (2005) is homogeneity: “a corpus should aim for homogeneity in its components, while maintaining adequate coverage, and rogue texts should be avoided” (Sinclair 2005, Criterion 9). In WriCLE, internal criteria guarantee homogeneity (essay topics, learner types, context of instruction, etc.). Rogue texts are manually eliminated by researchers (e.g. instances of plagiarism).

Homogeneity facilitates comparisons with similar corpora. This is an important design principle, as SLA research often involves comparisons between different groups of learners (different proficiency levels, different L1, etc.), as well as between learners and natives. As part of our research project, we have created a comparable ‘mirror image’ corpus: CEDEL2: Corpus Escrito del Español como Segunda Lengua(Written Corpus of L2 Spanish), which contains a variety of texts written by learners of L2 Spanish (L1 English), and which has been designed along the same principles as WriCLE (see Lozano 2009, forthcoming). This allows researchers to explore issues to do with transfer: we can see, for instance, whether Spanish null subjects are transferred into L2 English, and conversely we can check whether English overt subjects are transferred into L2 Spanish. CEDEL2 also contains a subcorpus of native Spanish texts, which is our ‘control’ corpus for L2 Spanish, but can also be used to see if features of L1 Spanish are present in the L2 English of L1 Spanish speakers.

The two corpora designed by Granger and her team: ICLE (International Corpus of Learner English) and LOCNESS (Louvain Corpus of Native English Essays) are also comparable to WriCLE. Since ICLE contains several subcorpora depending on the learners’ L1, we can establish comparisons between different groups of learners according to their L1, and we can also compare the learners’ L2 English production with the natives’ L1 production in LOCNESS, following the methodological approach known within the field of learner corpus research as Contrastive Interlanguage Analysis (Granger 1996; Gilquin2001).

3.1.6. Principle 6: Annotation

Most learner corpora are ‘raw’, because of the difficulty of annotating learner language, which shows a high degree of variability. It is this high degree of variability that makes it difficult to use automatic annotation software, which has often been designed on the basis of the native language, so manual annotation is the norm for learner corpora. Ideally, standardised corpus annotation should be employed in order to ensure comparability between different annotated corpora (see Granger 2002: 10). Researchers in L1 acquisition have been employing standardised annotation for a few decades under the CHILDES project(MacWhinney 2000a, 2000b). This, together with the availability of the annotated texts for the use of other researchers, is in part responsible for the remarkable developments in L1 acquisition research in the last decades. Though some argue that CHILDES can clearly be adapted to the needs of SLA researchers (see Rutherford & Thomas 2001, Myles 2005 and Mitchell et al. 2008), where such research demands dealing with large quantities of L2 data and sophisticated coding schemes, questions arise concerning, for instance, its suitability for complex written data and its user-friendliness.

In the absence of a standard coding scheme for learner data, SLA researchers tend to develop their particular annotation schemes, adapted to their specific projects. Annotation of WriCLE has been done using UAM CorpusTool (O’Donnell 2008): it is flexible – it can be adapted for each research project -, it allows for semi-automatic annotation, it produces annotations in an exchangeableXML format, and it incorporates a statistical package, among other features (more details in section 3.3 below).

3.1.7. Principle 7: Accessibility

As discussed above, progress in L1 acquisition research is partly due to data sharing, though the CHILDES project. Myles (2005) points out that if SLA research is going to experience a similar development, it is essential that data is shared in a similar fashion. This is why our intention is to make the complete corpus and database available to the academic community for research purposes.[4] Figure 1 shows the WriCLE homepage (in progress) and Figure 2 shows the online search interface, which is currently being developed (see note 4 below). It will incorporate the database described in section 3.1.4 above and will allow for the searching of subcorpora within WriCLE (e.g. according to proficiency level):

Figure 1: Homepage for WriCLE