List of text corpora explained

Text corpora (singular: text corpus) are large and structured sets of texts, which have been systematically collected. Text corpora are used by corpus linguists and within other branches of linguistics for statistical analysis, hypothesis testing, finding patterns of language use, investigating language change and variation, and teaching language proficiency.[1]

English language

European languages

Slavic

East Slavic

South Slavic

West Slavic

German

Middle Eastern Languages

Devanagari

East Asian Languages

South Asian Languages

African languages

Parallel corpora of diverse languages

Comparable Corpora

L2 (English) Corpora

See also

Notes and References

  1. Book: Leech, Geoffrey . Geoffrey Leech . Wichmann . A. . etal . Teaching and Language Corpora . 2007 . Longman . London . 9 . Teaching and language corpora: a convergence.
  2. Web site: Corpus Resource Database (CoRD). Department of English, University of Helsinki.
  3. Wahle . Jan Philip . Ruas . Terry . Mohammad . Saif . Gipp . Bela . 2022 . D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research . Proceedings of the Thirteenth Language Resources and Evaluation Conference . Marseille, France . European Language Resources Association . 2642–2651. 2204.13384 .
  4. Professor Mark Davies at BYU created an online tool to search Google's English language corpus, drawn from Google Books, at http://googlebooks.byu.edu/x.asp.
  5. Web site: PhraseFinder. A search engine for the Google Books Ngram Corpus that supports wildcard queries and offers an API.
  6. https://www.ehu.eus/en/web/eins/goenkale-corpusa
  7. Web site: Molinolabs - corpus. molinolabs.com. 12 January 2014.
  8. Web site: CorALit – CorALit - Lietuvių mokslo kalbos tekstynas. coralit.lt. 12 January 2014.
  9. Web site: Turkish National Corpus - Türkçe Ulusal Derlemi - Homepage. tnc.org.tr. 12 January 2014.
  10. Glazkova. A. Topical Classification of Text Fragments Accounting for Their Nearest Context. Automation and Remote Control. 2020. 81. 12. 2262–2276. 10.1134/S0005117920120097. 231929892.
  11. Rubtsova. Yu. Constructing a corpus for sentiment classification training. 2015. Software & Systems. 1. 72–78. 10.15827/0236-235X.109.072-078.
  12. Web site: Under Update. search.dcl.bas.bg. 12 January 2014.
  13. Web site: Електронски корупус на македонски книжевни текстови.
  14. Web site: Portál | Český národní korpus.
  15. Available from CLARIN. http://nl.ijs.si/me/v4/. 2010-05-14. Zdravkova. Katrina. Tufiş. Dan. Simov. Kiril. Radziszewski. Adam. Qasemizadeh. Behrang. Priest-Dorman. Greg. Petkevič. Vladimír. Oravecz. Csaba. Krstev. Cvetana. Kotsyba. Natalia. Kaalep. Heiki-Jaan. Ide. Nancy. Garabík. Radovan. Dimitrova. Ludmila. Derzhanski. Ivan. Barbu. Ana-Maria. Erjavec. Tomaž.
  16. Web site: University of Tehran NLP Lab. ece.ut.ac.ir. 12 January 2014. https://web.archive.org/web/20140128101521/http://ece.ut.ac.ir/NLP/. 28 January 2014. dead.
  17. Hadi Veisi, Mohammad MohammadAmini, Hawre Hosseini; Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus, Digital Scholarship in the Humanities, fqy074, https://doi.org/10.1093/llc/fqy074
  18. Web site: KOTONOHA「現代日本語書き言葉均衡コーパス」 少納言. kotonoha.gr.jp. 12 January 2014.
  19. Web site: Download Corpora Hindi .
  20. D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. de Silva, and G. Dias . 2015. Implementing a Corpus for Sinhala Language. In Symposium on Language Technology for South Asia.
  21. Glossa (uio.no)
  22. Web site: The Gulf of Guinea Creole Corpora . May 2014 . 523–529 .
  23. https://arxiv.org/pdf/2102.06991.pdf, https://wortschatz.uni-leipzig.de/en/download/Hausa
  24. Web site: IgTenTen – Igbo corpus from the web | Sketch Engine . 20 June 2022 .
  25. Web site: Oromo text corpora | Sketch Engine . 15 January 2019 .
  26. https://www.researchgate.net/publication/336274457_Digital_Yoruba_Corpus, https://www.sketchengine.eu/corpora-and-languages/yoruba-text-corpora/
  27. Web site: Download Corpora Zulu .
  28. Web site: Pan. Jun. 2019. The Chinese/English Political Interpreting Corpus (CEPIC). Hong Kong Baptist University Library. January 3, 2022.
  29. Pan. Jun. 2019-10-30. The Chinese/English Political Interpreting Corpus (CEPIC): A New Electronic Resource for Translators and Interpreters. Proceedings of the Second Workshop Human-Informed Translation and Interpreting Technology Associated with RANLP 2019. 82–88 . Incoma Ltd., Shoumen, Bulgaria. 10.26615/issn.2683-0078.2019_010. 211257773 . free.
  30. Web site: EUR-Lex Corpus. 2 June 2016. sketchengine.co.uk. 27 October 2016.
  31. Web site: OPUS - an open source parallel corpus. opus.lingfil.uu.se. 12 January 2014.
  32. Web site: Tatoeba - Number of sentences per language. tatoeba.org. 23 November 2020.
  33. Building and Annotating the Linguistically Diverse NTU-MC (NTU — Multilingual Corpus). Liling Tan and Francis Bond. 14 May 2012. International Journal of Asian Language Processing. 22. 4. 161–174. 12 January 2014. https://web.archive.org/web/20140116120131/http://www.colips.org/journal/volume22/22.4.2.NTU-MC%20Tan%20final.pdf. 16 January 2014. dead.
  34. Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri . 2014. SeedLing: Building and using a seed corpus for the Human Language Project. In Proceedings of the use of Computational methods in the study of Endangered Languages (ComputEL) Workshop. Baltimore, USA.
  35. H. Sanjurjo-González and M. Izquierdo. 2019. P-ACTRES 2.0: A parallel corpus for cross-linguistic research. In Parallel Corpora for Contrastive and Translation Studies: New resources and applications (pp. 215-231). John Benjamins Publishing.
  36. Liling Tan, Marcos Zampieri, Nikola Ljubešic, and Jörg Tiedemann. Merging comparable data sources for the discrimination of similar languages: The DSL corpus collection. In Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC). 2014.
  37. Book: 10.1007/978-3-642-32790-2_1. Getting to Know Your Corpus. Text, Speech and Dialogue. 7499. 3–15. Lecture Notes in Computer Science. 2012. Kilgarriff. Adam. 978-3-642-32789-6. 10.1.1.452.8074.
  38. Belinkov, Y., Habash, N., Kilgarriff, A., Ordan, N., Roth, R., & Suchomel, V. (2013). arTen-Ten: a new, vast corpus for Arabic. Proceedings of WACL.
  39. Kilgarriff, A., & Renau, I. (2013). esTenTen, a vast web corpus of Peninsular and American Spanish. Procedia - Social and Behavioral Sciences, 95, 12-19.
  40. Хохлова, М. В. (2016). Обзор больших русскоязычных корпусов текстов. In Материалы научной конференции" Интернет и современное общество" (pp. 74-77).
  41. Khokhlova, M. (2016). Comparison of High-Frequency Nouns from the Perspective of Large Corpora. RASLAN 2016 Recent Advances in Slavonic Natural Language Processing, 9.
  42. Trampuš, M., & Novak, B. (2012, October). Internals of an aggregated web news feed. In Proceedings of the Fifteenth International Information Science Conference IS SiKDD 2012 (pp. 431-434)
  43. Web site: CAWSE Corpus - The University of Nottingham Ningbo China - 宁波诺丁汉大学. nottingham.edu.cn. 2020-01-07.
  44. Web site: English as a Lingua Franca in Academic Settings. 2018-03-23. University of Helsinki. en. 2020-01-07.
  45. Mauranen. A. English as an academic lingua franca: The ELFA project. English for Specific Purposes. 2010. 29. 3. 183–190. 10.1016/j.esp.2009.10.001.
  46. Web site: ICLE. UCLouvain. en. 2020-01-07.
  47. Web site: LINDSEI. UCLouvain. fr. 2020-01-07.
  48. Web site: Trinity Lancaster Corpus ESRC Centre for Corpus Approaches to Social Science (CASS). en-US. 2020-01-07.
  49. Gablasova. D. 2019. The Trinity Lancaster Corpus: Development, Description and Application.. International Journal of Learner Corpus Research. 5. 2. 126–158. 10.1075/ijlcr.19001.gab. free.
  50. Juffs, A., Han, N-R., & Naismith, B. (2020). The University of Pittsburgh English Language Corpus (PELIC) [Data set].
  51. Web site: Project. univie.ac.at. 2020-01-07.