(NOTE: This article is the pre-release draft of an article to appear in the summer issue of Cambridge University Press publication, Cambridge Connections. Final version of the article to be posted later)
The New General Service List: A Core Vocabulary for EFL Students & Teachers
Dr. Charles Browne, Meiji Gakuin University
Dr. Brent Culligan, Aoyama Gakuin Women’s Junior College
Joseph Phillips, Aoyama Gakuin Women’s Junior College
The English Language has a surprisingly large number of words. Even if we count words like ACCEPT, ACCEPTS, ACCEPTING and ACCEPTABLE as part of the same “word family”, there are still more that 500 million wordsin English! Fortunately for teachers and students, language has built in redundancy, with certain words occurring much more frequently than others (the word THE, for example, makes up 6-7% of all the words in any book, magazine or newspaper). Because of this, the average native speaker of English knows only a small percentage of these half million words (about22,000 words for a recent college graduate).
Although 22,000 words may sound likea daunting number there is more good news. Corpus linguistics, the science of analyzing large collections of texts, has shown that knowledge of just a few thousand of the most important words can give an astonishing degree of coverage of English used in daily life. In 1953, Michael West published a list of about 2000 important vocabulary words known as the General Service List (GSL). Based on more than two decades of pre-computer corpus research and a corpus size of 2.5 million to 5 million words, the GSL gives about 84% coverage of general English. However, as useful and helpful as this list has been to us over the decades, it has been criticized for (1) being based on a corpus that is both datedand small by modern standards and (2) for not clearly defining what constitutes a “word”.
On the 60th anniversary of West’s publication of the GSL, we would like to announce the creation of a New General Service List (NGSL) that is based on a carefully selected 273 million-word subsection of the 1.6-billion-word Cambridge English Corpus (CEC). Following many of the same steps of West and his colleagues (as well as the suggestions of Professor Paul Nation, project advisor and a leading figure in modern second language vocabulary acquisition), we have tried to combine the strong objective scientific principles of corpus and vocabulary list creation with useful pedagogic insights to create a list of approximately 2800 high frequency words which meet the following goals:
- to update and expand the size of the corpus used (273 million words) compared to the limited corpus behind the original GSL (about 5 million words), with the hope of increasing the generalizability and validity of the list
- to create a NGSL of the mostimportant high-frequency words for second language learners of English which gives the highest possible coverage of English textswith the fewest words.
- to make a NGSL that is based on a clearer definition of what constitutes a word
- to be a starting point for discussion among interested scholars and teachers around the world, with the goal of updating and revising the list based on this input (in much the same way that West did with the original Interim version of the GSL)
The NGSL: A word list based on a large, modern corpus
Utilizing a range of computer-based corpus tools, we began developing the NGSL with an analysis of the Cambridge English Corpus (formerly known as the Cambridge International Corpus). The CEC is a 1.6 billion-word corpus of the English language that contains both written and spoken data of British and American English. The initial corpus was created using a subset of the 1.6 billion-word CEC that was queried and analyzed using the SketchEngine (2006) ( The size of each sub-corpus that was initially included is outlined in Table 1:
Table 1. CEC corpora used for preliminary analysis of NGSL
CorpusRunning Words
Newspaper748,391,436
Academic260,904,352
Learner38,219,480
Fiction37,792,168
Journals37,478,577
Magazines37,329,846
Non-Fiction35,443,408
Radio28,882,717
Spoken27,934,806
Documents19,017,236
TV11,515,296
Total1,282,909,322
Upon revision, the Newspaper and Academic corpora were removed from the compilation. The Newspaper corpus was removed because it’s enormous size (748,391,436 running words) dominated the total frequencies and it also showed a marked bias towards financial terms. The academic sub-corpus (260,904,352 words) was removed because it was a specific genre not directly related to general English. The final 273-million-word corpus is far more balanced as a result.
The resulting word lists were then cleaned up by removing proper nouns, abbreviations, slang and other noise, and excluding certain word sets such as days of the week, months of the year and numbers. Then we used a series of computations to combine the frequencies from the various sub-corpora while adjusting for differences in their relative sizes. Based on a series of meetings and discussions with Paul Nation about how to improve the list, the combined list was then compared to other important lists such as the original GSL, the BNC and COCA to make sure important words were included/excluded as necessary.
The NGSL: More coverage for your money!
One of the important goals of this project was to develop a NGSL that would be more efficient and useful to language learners and teachers by providing more coverage with fewer words than the original GSL. For a meaningful comparison between the GSL and NGSL to be done, the words on each list need to be counted in the same way. A comparison of the number of “word families” in the GSL and NGSL reveals that there are 1964 word families in the former and 2368 in the latter (using level 6 of Bauer and Nation’s 1993 word family taxonomy). Coverage within the 273 million word CEC is summarized in Chart 1, showing that the 2368 word families in the NGSL provide 90.34% coverage while the 1964 word families in the original GSL provide only 84.24%. That the NGSL with approximately 400 more word families provides more coverage than the original GSL may not seem a surprising result, but when these lists are lemmatized, the usefulness of the NGSL becomes more apparent as the more than 800 fewer lemmas in the NGSL provide 6.1% more coverage than is provided by West’s original GSL.
Vocabulary List / Number of “Word Families” / Number of “Lemmas” / Coverage in CEC CorpusGSL / 1964 / 3623 / 84.24%
NGSL / 2368 / 2818 / 90.34%
Where to find the NGSL:
The list of 2818 words is now available for download, comments and debate from a new website we’ve dedicated to the development of this list:
It is our hope that this list will be of use to you and your students. Please join the discussion on the NGSL as we begin to present on it at academic conferences throughout the year such as KOTESOL and the World Congress on Extensive Reading in Korea, JALT-CALL, and JALT National in Japan, the Vocab@Voc Conference in New Zealand, and the AILA Conference in Australia in mid 2014. Later this year you will also be able to find the NGSL taught in a new course from Cambridge University Press, In Focus.
Bibliography
West, M. (1953). A General Service List of English Words. London: Longman, Green & Co.
Bauer, L., & Nation, I. S. P. (1993). Word Families. International Journal of Lexicography, 6(4),253–279.
(this paper is a modified version of the article titled, “The New General Service List: Celebrating 60 years of Vocabulary Learning” published by Browne, C. in the July 2013 issue of JALT’s The Language Teacher)