Chinese character frequency is the applicational frequency of characters in written Chinese.It is calculated on a corpus, i.e., a collection of texts representing one or more languages. The frequency of a character is the ratio of the number of its occurrences to the total number of characters in the corpus, with the formula of,
where is the number of times a certain Chinese character appears in the corpus, and is the total number of (occurrences of) characters in the corpus.
Chinese character frequency is fundamental to quantitative linguistics of Chinese, and is of referential value to Chinese language teaching and information processing.
The first person to make a serious statistic study on the frequency of Chinese characters was Chen Heqin . In the 1920s, he and his assistants spent over two years manually counting and comparing the characters in a corpus of six categories of texts. There were totally 554,478 characters in 4,261 different character forms. They then compiled a book entitled Applied Lexis of Vernacular Chinese .The 10 most frequently-used characters in their corpus are, by descending frequency,
(of), (no, not), (one, a(n)),, (to be), (I/me), (on, up), (he/him), (to have), (person).
In 2001, the Chinese University of Hong Kong (CUHK) published a number of frequency lists on the Web, entitled "Hong Kong, Mainland China and Taiwan Chinese Frequency: a Trans-regional Diachronic Survey". The frequency data came from a grand corpus with a number of sub-corpora representing the Chinese languages in the three regions of Hong Kong, mainland China and Taiwan and in the two time periods of the 1960s and 1980/90's. Each sub-corpus consists of approximately 660,000 characters, making a total of 3,970,514 characters for the whole corpus. Each sub-corpus includes about 5,000 different characters, as shown by their frequency lists.
From the data of these frequency lists, some important and interesting features of Chinese can be discovered:
The top 10 characters in the frequency lists for the three regions of the 1980/1990's are Hong Kong: 的,一,是,不,人,有,在,了,我,中; Taiwan: 的,一,是,不,人,在,有,我,了,中; Mainland: 的,一,是,了,不,在,有,人,我,他.
More information can be found in the English Users' Guide on the home page.
Most of the previous frequency experiments are for comprehensive usage of Chinese characters.In addition, there is the frequency of use of Chinese characters in a certain discipline, such as news reporting, literature and art, information technology, etc.
And there are frequency lists for linguistic divisions.Polyphonic characters may be counted separately according to different pronunciations,for example, the frequencies for 的 (de), 的 (di1), 的 (di2) and 的 (di4).Polysemy characters are counted separately according to different meanings,for example, 里 (裡裏, inside) and 里 (里, 0.5 km). There are also frequencies for different parts of speech, for example: 花(n) and 花(v).Or a combination of the above divisions.
Chinese character frequency is essential to quantitative research of Chinese characters, and has been applied to language teaching, dictionary composition, character lists compilation, Chinese character information processing, etc.
The uses of Chinese characters mainly concentrate on frequently used characters. Zhou Youguang summarized the Chinese character utility decline rate based on the frequency statistics results of various parties. Its basic content is:
The coverage rate of the most frequently-used 1,000 characters on the corpus is about 90%, which means the missing rate is about 10%. For every additional 1,400 secondary frequent characters, the missing rate is reduced to 10% of the original number. For example,The missing rate of 1000+1400=2400 most frequently-used characters is approximately 10% * 10% =1% of the corpus, that means the coverage rate is 99%.The missing rate of 2400+1400=3800 most frequently-used characters is about 1% * 10% = 0.1%, and the coverage rate is 99.9%.The rule is supported by later experiment results as well, such as:
characters | occurrences | % | |
---|---|---|---|
100 | 782,866 | 42.14 | |
500 | 1,439,352 | 77.48 | |
1,000 | 1,681,228 | 90.50 | |
2,000 | 1,817,047 | 97.81 | |
3,000 | 1,848,648 | 99.51 | |
4,000 | 1,856,226 | 99.92 | |
4,868 | 1,857,660 | 100 |
The basic content of the Decreasing rate of frequently-used character strokes is:
The application rate of a character is inversely proportional to its number of strokes, that is,characters with high application rates have fewer strokes on average. This is supported by the data in article Stroke numbers.According to the data of the second and third tables, the average number of strokes of the 3,500 frequently-used characters is 9.74, and the average number of strokes of the 7.000 commonly-used characters (a super set of the 3,500 characters) is 10.75. That means generally speaking, frequently-used characters have less strokes than less frequently-used characters.
The reason is for convenience of writing.If a character of many strokes is used frequently, people will try to simplify it. If there are multiple variant characters of the same function, regardless of other reasons, the one with fewer strokes is more likely to be used.
When determining the importance of a character, in addition to frequency of use, it is often necessary to consider distribution rate.The formula for calculating distribution rate is
,where Di is the distribution rate of character or word i, ti is the number of texts in which the character or word appears, and T is the total number of texts in the corpus.
Application rate is a combination of distribution rate and frequency. A newer calculation formula is:
Ui=(Fi*Di)/Σ(j=1 to n)(Fj*Dj)
where Ui is the application rate of character i, Fi is the frequency of character i, Di is the distribution rate of character i, and n represents the total number of characters. This calculation method allows the cumulative application rates to approach 1.
Large-scale surveys by the Ministry of Education and the State Language Commission of PRC over the years have shown that the use of Chinese characters and words has a strong distribution pattern. The number of different characters used in modern Chinese is stable at about 12,000, and the number of different words has stabilized at around 2.5 million.[1]
The number of most frequently-used characters with a coverage rate of 80%, 90%, and 99% is about 590, 940, and 2,400 respectively.The number of words with coverage rates of 80%, 90%, 95%, and 99% is about 4,900, 14,000, 32,000, and 241,000 respectively.Words with greater changes from the previous years in frequency of use reflect the hot topics of social life and media attention that year.