General Internet Corpus of Russian explained

General Internet Corpus of Russian
Logocaption:-->
Commercial:No
Type:Educational/scientific project
Registration:Needed; given by request
Language:Russian language
Author:Vladimir Selegey, Vladimir Belikov, Serge Sharoff
Launch Date:2012
Current Status:Beta-testing

General Internet Corpus of Russian (GICR) is a corpus of Russian internet texts that has been accessible on request through an online query interface since 2013. The corpus includes rich text materials from the blogosphere, social networks, major news sources and literary magazines.

Goals of the project

The project has the status of an educational and scientific one, and many tasks of computational linguistics are solved by independent researchers and research groups with the materials obtained by GICR. While other corpus projects of Russian are focused on fiction and edited texts, General Internet Corpus provides linguists timely opportunity to learn the language as it is, with all the slang and regional peculiarities.

Corpus gives the opportunity to carry out research in

At various times, student papers and independent researches were carried out on the project material by students, graduates and employees of MSU, MIPT, Russian State Humanitarian University, Novosibirsk State University, Higher School of Economics, Russian Academy of Sciences, SFU, CSU, SGMP, IAAS of MSU.

Scientific project leaders:

The organizations involved in support of GICR:

Size and content of the corpus

Corpus size for the summer 2016 is 19.8 billion tokens, of which 49% are from VKontakte, 40% are from LiveJournal, another 4% - from Mail.ru Blogs and News, and 2% - from Russian Magazine Hall.[3] The sources collected in news segment are: RIA Novosti, Regnum, Lenta.ru, Rosbalt.Texts are provided with metamarkup (by date of creation of the text, sex, place and year of birth of the author, Internet genre, etc.); all texts are provided with automatic morphological tagging and lemmatization.[4] Most of the texts collected are of 2013–2014 years of creation, although in some segments, such as in Russian Magazine Hall, there are some texts collected since 1994.[5]

Corpus segment Words, millions Documents
Mail.Ru Blogs 707 9882120
VKontakte 9820 193770717
Live Journal 8110 73229158
Russian Magazine Hall 313 56547
News (ria, regnum, lentaru, rosbalt) 851 2964897
All corpora 19801 279903439

GICR is one of the few mega-corpora projects nowadays, which means its available size is reaching several billion of words.

Corpus Languages Access Site Size Facilities
COW: Free, Large Web Corpora in European Languages English, French, German, Spanish, Swedish, Dutch free, after registration, trial access is possible without registration https://web.archive.org/web/20160221212019/https://webcorpora.org/ 30 billion words KWIC format, morphological tagging, CQP search, markup and search by date, URL, country, city, etc.
English, French, German, Italian, Arabic, Russian, Spanish, Portuguese, Korean, Japanese, Chinese + more languages available at extra charge Paid access, trial access is possible after registration https://www.sketchengine.co.uk/ 86 billion words concordances, sketch grammar, thesaurus, KWIC, morphological tagging, CQP search
Aranea Corpora English, Russian, Finnish, French, German, Hungarian, Spanish, Italian, Dutch, Polish, Slovak Free, after registration, trial access is possible without registration http://sketch.juls.savba.sk/aranea_about/ 14 billion words noSketch Engine, concordances, sketch grammar, thesaurus, KWIC, morphological tagging, CQP search, comparable query results in different languages
GICR (General Internet Corpus of Russian) Russian Free, registration on request http://www.webcorpora.ru/en/ 20 billion words concordances, thesaurus, KWIC, morphological tagging, CQP search, markup and search by date, country, city, internet-segment, sex, year and place of birth of the author, “query mail” for users.
GloWbE (Corpus of Global Web-Based English) English, specification for 20 countries No registration http://corpus.byu.edu/glowbe/ 1,9 billion words KWIC, concordances, collocates, results comparable by dialects, CQP search, corpus can be downloaded

Access

Currently the interface of GICR is in beta stage, so access to the search in the corpora is provided and is free, but is available for researchers on request.[6]

See also

Further reading

  1. Belikov V., Kopylov N., Piperski A., Selegey V., Sharoff S., (2013), Big and diverse is beautiful: A large corpus of Russian to study linguistic variation. In Web as Corpus Workshop (WAC-8).
  2. Lagutin M. B., Katinskaya A. Y., Selegey V. P., Sharoff S., Sorokin A. A. (2015) Automatic Classification of Web Texts Using Functional Text Dimensions. In Dialogue, Russian International Conference on Computational Linguistics, Bekasovo
  3. Katinskaya A., Sharoff S. (2015) Applying Multi-dimensional Analysis to a Russian Webcorpus: Searching for Evidence of Genres, in Proc. of the Workshop on Balto-Slavic Natural Language Processing associated with the International Conference RANLP, Hissar, Bulgaria.

External links

Official site of GICR

Notes and References

  1. http://www.dialog-21.ru/digests/dialog2015/materials/pdf/LagutinMBetal.pdf Automatic Classification of Web Texts Using Functional Text Dimensions
  2. Web site: Collective | GICR.
  3. http://www.webcorpora.ru/%D0%BE-%D0%BA%D0%BE%D1%80%D0%BF%D1%83%D1%81%D0%B5
  4. //www.webcorpora.ru/%D0%BE-%D0%BA%D0%BE%D1%80%D0%BF%D1%83%D1%81%D0%B5

  5. Post in the blog: https://vk.com/wall-89094852_220
  6. Web site: Контакты | ГИКРЯ.