Speech corpus explained

A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions.In speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with a speech recognition or speaker identification engine).^[1] In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.^[2] ^[3]

A corpus is one such database. Corpora is the plural of corpus (i.e. it is many such databases).

There are two types of speech corpora:

Read Speech – which includes:
- Book excerpts
- Broadcast news
- Lists of words
- Sequences of numbers
Spontaneous Speech – which includes:
- Dialogs – between two or more people (includes meetings; one such corpus is the KEC);
- Narratives – a person telling a story (one such corpus is the Buckeye Corpus);
- Map-tasks – one person explains a route on a map to another;
- Appointment-tasks – two people try to find a common meeting time based on individual schedules.

A special kind of speech corpora are non-native speech databases that contain speech with a foreign accent.

References

Edwards, Jane / Lampert, Martin (eds.) (1992): Talking Data – Transcription and Coding in Discourse Research. Hillsdale: Erlbaum.
Leech, Geoffrey / Myers, Greg / Thomas, Jenny (eds.) (1995): Spoken English on Computer: Transcription, Markup and Application. Harlow: Longman.

Sarangi. Susanta . Sahidullah, Md . Saha, Goutam . Optimization of data-driven filterbank for automatic speaker verification . Digital Signal Processing . September 2020 . 104 . 102795 . 10.1016/j.dsp.2020.102795. 2007.10729 . 2020DSP...10402795S . 220665533 .
Reece . Andrew . Cooney . Gus . Bull . Peter . Chung . Christine . Dawson . Bryn . Fitzpatrick . Casey . Glazer . Tamara . Knox . Dean . Liebscher . Alex . Marin . Sebastian . 2022-03-01 . Advancing an Interdisciplinary Science of Conversation: Insights from a Large Multimodal Corpus of Human Speech . cs.CL . 2203.00674 .
Web site: Santa Barbara Corpus of Spoken American English Department of Linguistics - UC Santa Barbara . 2023-04-26 . www.linguistics.ucsb.edu.