A word list (or lexicon) is a list of a language's lexicon (generally sorted by frequency of occurrence either by levels or as a ranked list) within some given text corpus, serving the purpose of vocabulary acquisition. A lexicon sorted by frequency "provides a rational basis for making sure that learners get the best return for their vocabulary learning effort", but is mainly intended for course writers, not directly for learners. Frequency lists are also made for lexicographical purposes, serving as a sort of checklist to ensure that common words are not left out. Some major pitfalls are the corpus content, the corpus register, and the definition of "word". While word counting is a thousand years old, with still gigantic analysis done by hand in the mid-20th century, natural language electronic processing of large corpora such as movie subtitles (SUBTLEX megastudy) has accelerated the research field.
In computational linguistics, a frequency list is a sorted list of words (word types) together with their frequency, where frequency here usually means the number of occurrences in a given corpus, from which the rank can be derived as the position in the list.
Type | Occurrences | Rank | |
---|---|---|---|
the | 3,789,654 | 1st | |
he | 2,098,762 | 2nd | |
[...] | |||
king | 57,897 | 1,356th | |
boy | 56,975 | 1,357th | |
[...] | |||
stringyfy | 5 | 34,589th | |
[...] | |||
transducionalify | 1 | 123,567th |
Nation noted the incredible help provided by computing capabilities, making corpus analysis much easier. He cited several key issues which influence the construction of frequency lists:
Most of currently available studies are based on written text corpus, more easily available and easy to process.
However, proposed to tap into the large number of subtitles available online to analyse large numbers of speeches. made a long critical evaluation of this traditional textual analysis approach, and support a move toward speech analysis and analysis of film subtitles available online. This has recently been followed by a handful of follow-up studies,[1] providing valuable frequency count analysis for various languages. Indeed, the SUBTLEX movement completed in five years full studies for French, American English (;), Dutch, Chinese, Spanish, Greek, Vietnamese, Brazil Portuguese and Portugal Portuguese, Albanian, Polish and Catalan (2019[2]). SUBTLEX-IT (2015) provides raw data only.[3]
In any case, the basic "word" unit should be defined. For Latin scripts, words are usually one or several characters separated either by spaces or punctuation. But exceptions can arise, such as English "can't", French "aujourd'hui", or idioms. It may also be preferable to group words of a word family under the representation of its base word. Thus, possible, impossible, possibility are words of the same word family, represented by the base word *possib*. For statistical purpose, all these words are summed up under the base word form *possib*, allowing the ranking of a concept and form occurrence. Moreover, other languages may present specific difficulties. Such is the case of Chinese, which does not use spaces between words, and where a specified chain of several characters can be interpreted as either a phrase of unique-character words, or as a multi-character word.
It seems that Zipf's law holds for frequency lists drawn from longer texts of any natural language. Frequency lists are a useful tool when building an electronic dictionary, which is a prerequisite for a wide range of applications in computational linguistics.
German linguists define the Häufigkeitsklasse (frequency class)
N
N=\left\lfloor0.5-log | ||||
|
\right)\right\rfloor
\lfloor\ldots\rfloor
Frequency lists, together with semantic networks, are used to identify the least common, specialized terms to be replaced by their hypernyms in a process of semantic compression.
Those lists are not intended to be given directly to students, but rather to serve as a guideline for teachers and textbook authors . Paul Nation's modern language teaching summary encourages first to "move from high frequency vocabulary and special purposes [thematic] vocabulary to low frequency vocabulary, then to teach learners strategies to sustain autonomous vocabulary expansion" .
Word frequency is known to have various effects (;). Memorization is positively affected by higher word frequency, likely because the learner is subject to more exposures . Lexical access is positively influenced by high word frequency, a phenomenon called word frequency effect . The effect of word frequency is related to the effect of age-of-acquisition, the age at which the word was learned.
Below is a review of available resources.
Word counting is an ancient field,[4] with known discussion back to Hellenistic time. In 1944, Edward Thorndike, Irvin Lorge and colleagues[5] hand-counted 18,000,000 running words to provide the first large-scale English language frequency list, before modern computers made such projects far easier . 20th century's works all suffer from their age. In particular, words relating to technology, such as "blog," which, in 2014, was #7665 in frequency[6] in the Corpus of Contemporary American English,[7] was first attested to in 1999,[8] [9] [10] and does not appear in any of these three lists.
The French Ministry of the Education also provide a ranked list of the 1,500 most frequent word families, provided by the lexicologue Étienne Brunet.[16] Jean Baudot made a study on the model of the American Brown study, entitled "Fréquences d'utilisation des mots en français écrit contemporain".
More recently, the project Lexique3 provides 142,000 French words, with orthography, phonetic, syllabation, part of speech, gender, number of occurrence in the source corpus, frequency rank, associated lexemes, etc., available under an open license CC-by-sa-4.0.[17]
See main article: Most common words in Spanish.
There have been several studies of Spanish word frequency .[18]
Chinese corpora have long been studied from the perspective of frequency lists. The historical way to learn Chinese vocabulary is based on characters frequency . American sinologist John DeFrancis mentioned its importance for Chinese as a foreign language learning and teaching in Why Johnny Can't Read Chinese . As a frequency toolkit, Da and the Taiwanese Ministry of Education provided large databases with frequency ranks for characters and words. The HSK list of 8,848 high and medium frequency words in the People's Republic of China, and the Republic of China (Taiwan)'s TOP list of about 8,600 common traditional Chinese words are two other lists displaying common Chinese words and characters. Following the SUBTLEX movement, recently made a rich study of Chinese word and character frequencies.
Wiktionary:Frequency lists contains frequency lists in more languages.
Most frequently used words in different languages based on Wikipedia or combined corpora.