Cepstral mean and variance normalization explained

Cepstral mean and variance normalization (CMVN) is a computationally efficient normalization technique for robust speech recognition. The performance of CMVN is known to degrade for short utterances. This is due to insufficient data for parameter estimation and loss of discriminable information as all utterances are forced to have zero mean and unit variance. ^[1]

CMVN minimizes distortion by noise contamination for robust feature extraction by linearly transforming the cepstral coefficients to have the same segmental statistics.^[2] Cepstral Normalization has been effective in the CMU Sphinx for maintaining a high level of recognition accuracy over a wide variety of acoustical environments.^[3]

Cepstral Normalization Techniques

There are multiple algorithms that achieve Cepstral Normalization in different ways.

Fixed codeword-dependent cepstral normalization (FCDCN)

FCDCN was developed to provide a form of compensation that provides greater recognition accuracy than SDCN but in a more computationally-efficient manner than the CDCN algorithm. The FCDCN algorithm applies an additive correction that depends on the instantaneous SNR of the input (like SDCN), but that can also vary from codeword to codeword (like CDCN).

Multiple Fixed Codeword-dependent Cepstral Normalization (MFCDCN)

MFCDCN is a simple extension of FCDCN algorithm that does not need environment specific training. In MFCDCN, compensation vectors are pre-computed in parallel for a set of target environments, using the FCDCN algorithm.

Incremental Multiple Fixed Codeword-dependent Cepstral Normalization (IMFCDCN)

While environment selection for the compensation vectors of MFCDCN is generally performed on an utterance-by-utterance basis, IMFCFCN improves on it by allowing the classification process to make use of cepstral vectors from previous utterances in a given session.

Cepstral Noise Subtraction

Automatic speech recognition (ASR) describes the steps of transcribing speech utterances represented as acoustic wave forms to written words. As is, CMVN has been used in different applications as this technique has proven to provide better speech recognitions results in different environments. CMVN has the capabilities to reduce differences between test and training data produced by channel distortions and colorizations . CMVN has also been found to be able to reduce differences in feature representation between speakers can also partly reduce the influence of background noise.^[4]

Notes and References

Prasad, N.V, Umesh, S. "Improved cepstral mean and variance normalization using Bayesian framework", IEEE, 2013, Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pp. 156–161,
Viikki, O. and Laurila, K., ”Cepstral domain segmental feature vector normalization for noise robust speech recognition”, Speech Communication, 25(1-3):133-147,1998
Liu, F., Stern, R., Huang, X., and Acero, A. (1993). Efficient cepstral normalization for robust speech recognition. Proc.ARPA Workshop on Human Language Technology, Princeton, NJ.
Rehr, R., & Gerkmann, T. (2015). Cepstral noise subtraction for robust automatic speech recognition. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).