Constructing and evaluating Web corpora: ukWaC

Adriano Ferraresi

SITLeC – University of Bologna (Forlì)

Corso Diaz 64, 47100 Forlì – Italy

ABSTRACT

This paper reports on the construction and evaluation of a very large Web corpus of English. The corpus, called ukWaC, was obtained through a crawl of Web pages in the .uk domain, and in its final version contains around two billion words. Its aim is that of providing a general-purpose linguistic resource for the study of (British) English. Ideally, this new resource would be comparable to the widely used British National Corpus, while containing up-to-date and substantially larger quantities of language data. As with all corpora built using semi-automated procedures, the possibility of controlling the materials that end up in the corpus is limited, and post hoc evaluation is needed to appraise actual corpus composition. An evaluation method along the lines of Sharoff (2006) is proposed and applied which involves a comparison between ukWaC and the BNC. Different wordlists are created for the main part-of-speech categories (i.e. nouns, verbs, adjectives, -ly adverbs and function words), which are then compared via the log-likelihood measure, thus grouping words that are relatively more typical of one corpus with respect to the other (Rayson and Garside 2000). Results suggest that the two corpora differ insofar as ukWaC contains a higher proportion of texts related to the Web, education and “public sphere” issues, while the BNC contains more fiction and spoken texts. The paper concludes by discussing some of the issues and challenges raised by research on the construction and evaluation of Web corpora.

  1. INTRODUCTION

As an immense, free, and easily accessible resource, the World Wide Web has increasingly been used as a source of linguistic data, especially when existing resources prove inadequate to answer certain research questions (Kilgarriff and Grefenstette, 2003). This is the case, for instance, when less common or relatively new linguistic phenomena are the object of study, and well-established, but somewhat small (or “old”), collections of texts provide insufficient evidence for analysis. In other cases, such as the study of specialized linguistic sub-domains or of minority languages, no resource exists (Scannel, 2007). To indicate such uses of the Web within language studies, the expression “Web as corpus” (or “WaC”) is often used(Baroni and Bernardini, 2006).

This paper reports on a “WaC” construction effort, and presents a preliminary evaluation of the new resource created, called ukWaC (because it is a Web-derived Corpus of English constructed sampling .uksites). UkWaC, whose linguistic materials were retrieved by Web crawling, contains around two billion tokens, and features basic linguistic annotation (part-of-speech tagging and lemmatization). It was built with the intention of providing a very large and up-to-date resource that would be comparable, in terms of “balancedness” and variety of linguistic materials, to traditional general-purpose corpora (in particular, the British National Corpus, a well-established standard for British English). As is the case for all corpora built with semi-automated procedures, however, the possibility to control the materials that end up in the final corpus is limited. This makes post hoc evaluation a crucial task for the purpose of appraising actual corpus composition. A corpus evaluation method is therefore proposed and applied to the task of comparing ukWaC and the BNC.

The paper is structured as follows: Section 2 briefly outlines the procedure that was followed to construct ukWaC. In Section 3, a vocabulary-based comparison between ukWaC and the BNC is carried out, which sheds light on the main differences between the two corpora. Section 4 concludes by discussing some of the issues and challenges raised by research on the construction and evaluation of Web corpora, such as the need for better data cleaning techniques and more thorough corpus evaluation methods.

  1. BUILDING A VERY LARGE WEB-DERIVED CORPUS: ukWaC

This Section briefly describes how the ukWaC corpus was built. It should be noted that the construction of ukWaC follows that of two similar corpora of German (deWaC) and Italian (itWaC), and thatthese three resources are among the achievements of an international research project called WaCky (Web as Corpus kool yinitiative).[1]Since the procedure developed within this project,which is described in detail in Baroni and Kilgarriff (2006), and Baroni and Ueyama (2006), is largely language-independent, in this Section attention will be paid only to those aspects peculiar to theconstruction of ukWaC.We will focus in particular on the first steps of the procedure, i.e. “seed” URLs selection and crawling (cf. Section 2.1), during which critical decisions regarding thedocument sampling strategy are taken.

2.1Seed selection and crawling

Our aim was to set up a resource comparable to more traditionalgeneral language corpora, containing a wide rangeof text types and topics. These should include both ‘pre-Web’texts of a varied nature that can also be found inelectronic format on the Web (spanning from sermons torecipes, from technical manuals to short stories, and ideallyincluding transcripts of spoken language as well), andtexts representing Web-based genres (Santini and Sharoff,2007), like personal pages, blogs, or postings in forums. Itshould be noted that the goal here was for the corpus to berepresentative of the language of interest, i.e. contemporary British English, rather than beingrepresentative of the language of the Web.

The first step consisted in identifying sets of seed URLs which ensuredvariety in terms of content and genre. In order to findthese, 2,000 random pairs of randomly selected content words were submitted to Google. Previousresearch on the effects of seed selection upon the resultingWeb corpus (Ueyama, 2006) suggested that automaticqueries to Google which include words sampled from traditionalwritten sources such as newspapers and referencecorpus materials tend to yield ‘public sphere’ documents,such as academic and journalistic texts addressing socio-political issues and the like. Issuing queries with wordssampled from a basic vocabulary list, on the contrary, tendsto produce corpora featuring ‘personal interest’ pages, likeblogs or bulletin boards. Since it is desirable that both kindsof documents are included in the corpus, differentseed sources were sampled.

Three sets of queries were generated: the first set (1,000 word pairs) was obtained by combining mid-frequencycontent words from the BNC. The second list of queries (500 word pairs) wasgenerated from a vocabulary list for foreign learners of English,[2]which (however counter-intuitively) contains ratherformal and low-frequency vocabulary, possibly required for academic study inEnglish. In order to obtain more basic, informalwords, the third list (500 word pairs) was created byrandomly combining words sampled from the spoken sectionof the BNC.

TheURLs obtained from Google were fed to the Heritrix crawler[3] in random order, andthe crawl was limited to pages in the .ukWeb domainwhose URL does not end in a suffix cuing non-html data(.wav, .jpg, etc.). While not ensuring that all thepages retrieved represent British English,the strategy was nonetheless used as a simple heuristic to retrieve the largest possible number of pages which are (supposedly) published in the United Kingdom.

2.2Post-crawl cleaning and annotation

The documents retrieved by crawlingunderwent various cleaning steps, meant to drastically reduce noise in the data. First, onlydocuments that were of mime type text/htmland between5 and 200KB in size were kept for further processing. As observed by Fletcher(2004), verysmall documents tend to contain little genuine text (5KBcounts as ‘very small’ because of the html code overhead)and very large documents tend to be lists of various sorts,such as library indices, store catalogs, etc.We also identified and removed all documents that had perfectduplicates in the collection, since theseturned out to be mainly repeated instances of warning messages, copyright statements and thelike. While in this way we might also have wasted relevantcontent, the guiding principle in our Web-as-corpusconstruction approach is that of privileging precision overrecall, given the vastness of the data source.

All the documents that passed this pre-filteringstage underwent further cleaning based on theircontents. First, we had to remove code (html andjavascript), together with the so-called ‘boilerplate’, i.e.,following Fletcher (2004), all those parts of Web documentswhich tend to be the same across many pages (for instancedisclaimers, navigation bars, etc.), and which are poor inhuman-produced connected text. From the point of viewof our target user, boilerplate identification is critical, sincetoo much boilerplate will invalidate statistics collected fromthe corpus and impair attempts to analyze the text by lookingat KWiC concordances.

Relatively simple language filters were then applied to the remaining documents, so as to discard documents in foreign languages and machine-generated text, such as that used in pornographic pages to “trick” search engines. Finally, near-duplicate documents, i.e.documents sharing considerable portions of text, were identified and discarded through a re-implementation of the “shingling”algorithm proposed by Broder et al. (1997).

At this point, we enriched the surviving text withpart-of-speech and lemma information, using the TreeTagger.[4]

  1. EVALUATING ukWaC THROUGH A VOCABULARY-BASED COMPARISON

When corpora are built through automated procedures, asis the case for ukWaC, there is limited control over the contentsthat make up the final version of the corpus. Posthocevaluation is therefore needed to appraise actual corpuscomposition. Along the lines of Sharoff (2006), here we provide a qualitative evaluation of our Webcorpus based on a vocabulary comparison with the widelyused BNC, which was re-tagged for the purposes of this analysis using the same tools as ukWaC (cf. Section 2.2).

Separate lists of nouns were created for ukWaC and the BNC, whichwere then compared via the log-likelihood association measure (Dunning, 1993).[5] Relying on the tagger’s output, the procedure makes itpossible to identify the word items tagged as nouns that are most typicalof either corpus when compared to the other.

For each of the 50 words with the highest log-likelihood ratio, 250randomly selected concordances were retrieved and analyzed, and the associated URL was also checked, to gather additionalcontextual information on the page.

Based on their contexts of use, the nouns that turn out to be themost typical of ukWaC when compared to the BNC belong to threemain semantic fields[6] (see Table 1 for some examples), i.e., (a) computersand the Web, (b) education, and (c) what may be called ‘publicsphere’ issues. In category (a) we find words like website, link, andbrowser. These nouns are distributed across a wide variety of text types,ranging from online tutorials to promotional texts introducing, e.g., aWeb-based service. Unsurprisingly, a word which has become part ofeveryday language like websitedoes not appear at all in a corpus likethe BNC, constructed in the early nineties.

The analysis of the concordances and associated URL for nouns belonging to category(b) (e.g., students, research), and (c) (e.g., organisations, nhs, health),for each of which the associated URL was also checked, suggests thattheir (relatively) high frequency can be explained by the considerablepresence in ukWaC of certain entities responsible for the publishing ofWeb contents. These are either universities – in the case of (b) – or nongovernmentalorganizations or departments of the government – in thecase of (c). Typical topics dealt with in these texts are on the one handeducation and training and, on the other, public interest issues, such asassistance for citizens in need. What is most remarkable is the varietyof the text genres which are featured. As pointed out by Thelwall (2005), academicsites may contain very different types of texts, whosecommunicative intention and register can differ substantially. We find‘traditional’ texts, like online prospectuses for students and academicpapers, as well as ‘new’ Web-related genres like homepages of researchgroups. In the same way, the concordances of a word like nhsreveal thatthe acronym is distributed across text types as diverse as newspaperarticles regarding quality issues in the services for patients and forumpostings on the treatment of diseases.

ukWaC
Web and computers / Education / Public sphere
website / link / students / services
site / data / skills / organisations
click / download / projects / nhs
web / file / research / health
email / browser / projects / support
BNC
Imaginative / Spoken / Politics and economy
eyes / door / er / government
man / house / cos / recession
face / hair / sort / plaintiff
mother / smile / mhm / party

Table 1. Examples of nouns typical of ukWaC and the BNC by semantic field.

The nouns most typical of the BNC compared to ukWaC can also begrouped into three macro-categories (examples are provided in Table1), i.e., (a) nouns related to the description of people or objects, (b)expressions which are frequent in the spoken language (or, more precisely,typical transcriptions of such expressions), and (c) words relatedto politics, economy and public institutions. The words included incategory (a) are names of body parts, like eyes, and face; words usedto refer to people, such as manand motherand names of objects andplaces, like door, and house. All of these share the common feature ofappearing in the clear majority of cases in texts classified by Lee (2001)as ‘imaginative’ or ‘fiction/prose’. As an example, eyesappears 74% ofthe times in BNC ‘fiction/prose’ texts, and manappears in this type of textsalmost 41% of the times. In general, what can be inferred from thedata is that, compared to ukWaC, the BNC seems to contain a higherproportion of narrative fiction texts, confirming that “texts aimed atrecreation [such as fiction] are treated as an important category intraditional corpora” (Sharoff, 2006: 85), whereas they are rarer in Webcorpora. This may be due to the nature of the Web itself, since copyrightrestrictions often prevent published fiction texts from being freelyavailable online.

Category (b) includes expressions which are typically associatedwith the spoken language, including graphical transcriptions of hesitations,backchannels and reduced forms. Among these we find er, cos,mhm, which appear most frequently in the spoken part of the BNC.These words are clearly not nouns. However, since the same taggingmethod was applied to the two corpora, it is likely that they really aremore typical of the BNC, inasmuch as their relatively higher frequencycannot be accounted for by differences in tagger behavior. A noun likesortis also frequently featured in the spoken section of the BNC, beingoften found in the expression sort of. Spoken language is obviouslyless well represented in ukWaC than in the BNC, which was designedto contain 10% transcribed speech.This does not mean, however,that spoken-like genres are absent from the former, like, e.g. texts which reproduce informal, interactive, “spoken-like” language, as may be considered that of blogs and online forums of discussion. The last group of words (c) which share important common traitsin terms of their distribution across text genres and domains is thatof words associated with politics, economy and public institutions. Examplesof these nouns are government, recessionand plaintiff. All ofthese are mainly featured in BNC texts that are classified as belongingto the domain ‘world affairs’, ‘social sciences’ or ‘commerce’, and occur both in academic and non-academic texts. As a category, this seems tooverlap with the group of words related to public sphere issues whichare typical of ukWaC. However, the specific vocabulary differs becausethe texts dealing with politics and economy in ukWaC seem to share abroad operative function, e.g. offering guidance or promoting a certaingovernmental program, as in the following examples:

OGC offers advice, guidance and support;

Local business support services include the recently establishedSussex Business;

use Choice Advisers to provide practical support targeted atthose parents.

Concordances reveal instead that, in the BNC, words like governmentor recessionare more frequently featured in texts which comment ona given political or economic situation, as newspaper editorialswould do, for example:

is urging the government to release all remaining prisoners ofconscience;

Despite assurances from government officials that an investigationis underway;

a crucial challenge to the cornerstone of his government’s economicpolicy.

Lastly, a note of caution. The analysis in this Section has highlighted several lexical differencesbetween ukWaC and the BNC. One should keep in mind, however, thata high log-likelihood score is an indicator of relative typicality in onecorpus or the other. The noun eyes, for instance, appears as the 4thmost typical noun of the BNC, even though its absolute frequency isnearly 15 times smaller than in ukWaC. Thus, the fact that a wordis typical of the BNC does not imply that it is not equally well representedin ukWaC – note that ukWaC covers around 99% of theBNC vocabulary. Moreover, the method is apt at highlighting strongasymmetries in the two corpora, but it conceals those features thatmake them similar. In future work, we intend to determine what kindsof text types or domains do not turn up as typical of either ukWaC orthe BNC, and assess whether there is ground to conclude that they aresimilarly represented in both corpora.