Russian National Corpus Explained

The Russian National Corpus (Russian: Национальный корпус русского языка||National Corpus of the Russian language) is a corpus of the Russian language that has been partially accessible through a query interface online since April 29, 2004. It is being created by the Institute of Russian language, Russian Academy of Sciences.

It currently contains more than 1 billion word forms^[1] that are automatically lemmatized and POS-/grammeme-tagged, i.e. all the possible morphological analyses for each orthographic form are ascribed to it. Lemmata, POS, grammatical items, and their combinations are searchable. Additionally, 6 million word forms are in the subcorpus with manually resolved homonymy.

The subcorpus with resolved morphological homonymy is also automatically accentuated. The whole corpus has a searchable tagging concerning lexical semantics (LS),^[2] including morphosemantic POS subclasses (proper noun, reflexive pronoun etc.), LS characteristics proper (thematic class, causativity, evaluation), derivation (diminutive, adverb formed from adjective etc.).

The RNC includes also the following subcorpora:

a treebank of syntactical dependencies (largely based on the Igor Mel'čuk's Meaning-Text Theory)
English⇔Russian, German⇒Russian, Ukrainian⇔Russian and Belorussian⇔Russian parallel corpora;
a large (100+ million words) separate corpus of modern newspapers (2001–2011);
a corpus of Russian poetry, where the rhyming words and poetic prosody (including meter, stanzas etc.) is additionally tagged;
a corpus of Russian dialects with specific dialect grammar tagging;
a multimedia corpus with searchable tagged fragments of Russian-language movies;
a corpus showing the history of Russian stress
an educational subcorpus reflecting school standards.

All the texts have tags bearing metatextual information - the author, his/her birth date, creation date, text size, text genres (general fiction, detective story, newspaper article etc.); all these categories are browsable and searchable separately. It is possible to define a user's subcorpus to search lemmata/POS-grammeme/semantic tags combinations only within this subset.

Notes and References

Web site: Национальный корпус русского языка . Russian . Национальный корпус русского языка . August 28, 2022 . https://web.archive.org/web/20220305105245/https://ruscorpora.ru/new/ . March 5, 2022.
Apresjan . Ju. . Boguslavsky . I. . Iomdin . B. . Iomdin . L. . Sannikov . A. . Sizov . V. . 2006 . 10.1.1.111.8165 . A Syntactically and Semantically Tagged Corpus of Russian: State of the Art and Prospects . Proceedings of LREC . Genova, Italy . 1378–1381 .

Russian National Corpus Explained

See also

Notes and References