LIVAC Synchronous Corpus explained

LIVAC
Collapsible:	yes
Released:	July 1995
Developer:	Chilin (HK) Ltd.
Operating System:	Cross-platform
Language:	English, Traditional and Simplified Chinese
Genre:	Corpus
Latest Release Version:	V3.1
Latest Release Date:	Feb 2024

LIVAC is an uncommon language corpus dynamically maintained since 1995. Different from other existing corpora, LIVAC has adopted a rigorous and regular "Windows" approach in processing and filtering massive media texts from representative Chinese speech communities such as Beijing, Hong Kong, Macau, Taipei, Singapore, Shanghai, as well as Guangzhou, and Shenzhen.^[1] The contents are thus deliberately repetitive in most cases, represented by textual samples drawn from editorials, local and international news, cross-Taiwan Strait news, as well as news on finance, sports and entertainment.^[2] By 2023, more than 3 billion characters of news media texts have been filtered, of which 700 million characters have been processed and analyzed and have yielded an expanding Pan-Chinese dictionary of 2.5 million words from the Pan-Chinese printed media. Through rigorous analysis based on computational linguistic methodology, LIVAC has at the same time accumulated a large amount of accurate and meaningful statistical data on the Chinese language and on their diverse speech communities in the Pan-Chinese context, and the results show considerable and important long standing as well as evolving variations.^[3] ^[4]

The "Windows" approach is the most innovative feature of LIVAC and has enabled Pan-Chinese media texts to be quantitatively analyzed according to various attributes such as locations, time and subject domains. Thus, various types of comparative studies and applications in information technology as well as development of often related innovative applications have been possible.^[5] ^[6] Moreover, LIVAC has allowed longitudinal developments to be taken into account, facilitating Key Word in Context (KWIC) search and comprehensive study of target words and their underlying concepts as well as linguistic structures over the past 25 years, based on the above mentioned variables of location, time and subject. Results from the extensive and accumulative data analysis contained in LIVAC have enabled the cultivation of textual databases of proper names, place names, organization names, new words, and bi-weekly and annual rosters of media figures. Related applications have included the establishment of verb and adjective databases, the formulation of sentiment indices, and related opinion mining, to measure and compare the popularity of global media figures in the Chinese media (LIVAC Annual Pan-Chinese Celebrity Rosters, later renamed as the Pan-Chinese Newsmaker Rosters),^[7] ^[8] ^[9] ^[10] ^[11] and compilation of new word databases (LIVAC Annual Pan-Chinese New Word Rosters).^[12] ^[13] ^[14] ^[15] ^[16] On this basis, the analysis of the emergence, diffusion and transformation of new words, and the publication of dictionaries of neologisms have been made possible.^[17] ^[18]

A recent focus is on the relative balance between disyllabic words and growing trisyllabic words in the Chinese language,^[19] and the comparative study of light verbs in three Chinese speech communities.^[20] as well as the link between the language use and use of language as a reflection of epochal change in China.^[21] A new LIVAC version 3.1 was launched in February 2024.

Corpus data processing

Accessing media texts, manual input, etc.
Text unification including conversion from simplified to traditional Chinese characters, stored as Big5 and Unicode versions
Automatic word segmentation
Automatic alignment of parallel texts
Manual verification, part-of-speech tagging
Extraction of words and addition to regional sub-corpora
Combination of regional sub-corpora to update the LIVAC corpus, and master lexical database

Labeling for data curation

Categories used include general terms and proper names, such as: general names, surnames, semi titles; geographical, organizations and commercial entities, etc.; time, prepositions, locations, etc.; stack-words; loanwords; case-word; numerals, etc.
Construction of databases of proper names, place names, and specific terms, etc.
Generate rosters: "new word rosters", "celebrity or media personality rosters", "place name rosters", compound words and matched words
Other parts of speech tagging for sub-database, such as common nouns, numerals, numeral classifiers, different types of verbs, and of adjectives, pronouns, adverbs, prepositions, conjunctions, particles marking mood, onomatopoeia, interjection, etc.

Applications

Compilation of Pan-Chinese dictionaries or local dictionaries
Information technology research, such as predictive Chinese text input for mobile phones, automatic speech to text conversion, opinion mining
Comparative studies on linguistic and cultural developments in the Pan-Chinese regions, especially in a critical period of history in modern China.
Language teaching and learning research, and speech-to-text conversion
Customized service on linguistic research and lexical search for international corporations and government agencies

The above applications are provided by the following functions:

Word Segmentation Search

Phrase Search

Example Sentence Selection

Multi-word Comparison

Word Cloud

References

Tsou, Benjamin; Lai, Tom; Chan, Samuel; and Wang, William S.-Y. (Eds). (1998). Quantitative and Computational Studies on the Chinese Language 《漢語計量與計算研究》. Language Information Sciences Research Centre, City University Press.
Tsou, B. K., Kwong, O.Y. (Eds). (2015). Linguistic Corpus and Corpus Linguistics in the Chinese Context (Journal of Chinese Linguistics Monograph Series Number 25), Hong Kong: Chinese University Press.
Tsou, Benjamin. (2004). "Chinese Language Processing at the Dawn of the 21st Century", in C R Huang and W Lenders (eds) Language and Linguistics Monograph Series B: Frontiers in Linguistics I, pp.189–207. Institute of Linguistics, Academia Sinica.
Tsou, B. K. (2017). Loanwords in Mandarin Through Other Chinese Dialects. In R. Sybesma, W. Behr, Y. Gu, Z. Handel, C.-T. Huang & J. Myers (Eds.), The Encyclopaedia of Chinese Language and Linguistics (Vol. 2, pp. 641-647). Leiden; Boston: BRILL
Tsou, Benjamin, and Kwong, Olivia. (2015). LIVAC as a Monitoring Corpus for Tracking Trends beyond Linguistics. In Tsou, Benjamin, and Kwong, Olivia., (eds.), Linguistic Corpus and Corpus Linguistics in the Chinese Context (Journal of Chinese Linguistics Monograph Series No.25). Hong Kong: The Chinese University Press, pp. 447-471.
Tsou, Benjamin. (2016). Skipantism Revisited: Along with Neologisms and Terminological Truncation. In Chin, Chi-on Andy and Kwok, Bit-chee and Tsou, Benjamin K., (eds.), Commemorative Essays for Professor Yuen-Ren Chao: Father of Modern Chinese Linguistics. Taiwan: Crane Publishing. pp. 343-357.
http://wikisites.cityu.edu.hk/sites/newscentre/en/Pages/201512281400.aspx CityU releases 2015 LIVAC Pan-Chinese Media Personality Roster
http://wikisites.cityu.edu.hk/sites/media/pr/Pages/2017010201.aspx CityU releases 2016 LIVAC Pan-Chinese Media Personality Roster
https://www.cityu.edu.hk/zh-hk/media/press-release/2020/01/07/chengdagongbu2019nianfanhuadequlivacxinwenrenwubang-chinese-version-only CityU releases 2019 LIVAC Pan-Chinese Media Personality Roster
Web site: Pan-Chinese top newsmakers of 2020. 2021-01-18. City University of Hong Kong. 13 January 2021 . en.
Web site: A Big Database Approach to 2 Decades of LIVAC Pan-Chinese Newsmaker Rosters: - chilin.hk . 2023-01-20 . Chilin.hk . 20 January 2023 . en-US.
http://wikisites.cityu.edu.hk/sites/newscentre/en/Pages/201502121400.aspx CityU releases 2014 Pan-Chinese New Word Rosters
http://wikisites.cityu.edu.hk/sites/newscentre/en/pages/201602041130.aspx CityU releases 2015 LIVAC Pan-Chinese New Word Rosters
https://www.cityu.edu.hk/zh-hk/media/press-release/2020/01/09/buzz-words-2019-released-cityus-livac-pan-chinese-linguistic-database CityU releases 2019 LIVAC Pan-Chinese New Word Rosters
Web site: New Chinese Buzz words for 2020 released by LIVAC Pan-Chinese linguistic database. 2021-01-18. City University of Hong Kong. 18 January 2021 . en.
Web site: New Chinese Buzz words for 2021 released by CityU . 2023-01-20 . City University of Hong Kong.
鄒嘉彥、游汝杰（編）（2007），《21世紀華語新詞語詞典》（簡體字版），上海，復旦大學出版社。
鄒嘉彥、游汝杰（編）（2010），《全球華語新詞語詞典》，北京，商務印書館。
鄒嘉彥（2019）， "泛華語地區多音節詞的近20年發展：從LIVAC大數據庫探討 (Developments if polysyllabic words in Pan-Chinese in the recent decades: Investigation based on LIVAC Big Database)"，《漢語歷史詞彙語法國際學術研討會(International Conference of Historical Investigations into Chinese words and Grammar)》，北京大學。
Tsou, Benjamin K., and Ka-Fai Yip. "A corpus-based comparative study of light verbs in three Chinese speech communities." Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation. 2020.
Tsou, B. K. (2022). Some Salient as well as Divergent and Convergent Linguistic Developments in Chinese - A Big Data and Trans-Millennial Approach. The 28th Annual Conference of the International Association of Chinese Linguistics [Keynote Speech], Hong Kong.

External links

- Chilin (HK)'s website