Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks (DNN) are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.
Given an input text or some sequence of linguistic unit
Y
X
where
\theta
Typically, the input text will first be passed to an acoustic feature generator, then the acoustic features are passed to the neural vocoder. For the acoustic feature generator, the Loss function is typically L1 or L2 loss. These loss functions impose a constraint that the output acoustic feature distributions must be Gaussian or Laplacian. In practice, since the human voice band ranges from approximately 300 to 4000 Hz, the loss function will be designed to have more penalty on this range:
where
losshuman
\alpha
In September 2016, DeepMind proposed WaveNet, a deep generative model of raw audio waveforms, demonstrating that deep learning-based models are capable of modeling raw waveforms and generating speech from acoustic features like spectrograms or mel-spectrograms. Although WaveNet was initially considered to be computationally expensive and slow to be used in consumer products at the time, a year after its release, DeepMind unveiled a modified version of WaveNet known as "Parallel WaveNet," a production model 1,000 faster than the original.
In early 2017, Mila proposed char2wav, a model to produce raw waveform in an end-to-end method. In the same year, Google and Facebook proposed Tacotron and VoiceLoop, respectively, to generate acoustic features directly from the input text; months later, Google proposed Tacotron2, which combined the WaveNet vocoder with the revised Tacotron architecture to perform end-to-end speech synthesis. Tacotron2 can generate high-quality speech approaching the human voice.
Currently, self-supervised learning has gained much attention through better use of unlabelled data. Research has shown that, with the aid of self-supervised loss, the need for paired data decreases.[1] [2]
Zero-shot speaker adaptation is promising because a single model can generate speech with various speaker styles and characteristic. In June 2018, Google proposed to use pre-trained speaker verification models as speaker encoders to extract speaker embeddings.[3] The speaker encoders then become part of the neural text-to-speech models, so that it can determine the style and characteristics of the output speech. This procedure has shown the community that it is possible to use only a single model to generate speech with multiple styles.
In deep learning-based speech synthesis, neural vocoders play an important role in generating high-quality speech from acoustic features. The WaveNet model proposed in 2016 achieves excellent performance on speech quality. Wavenet factorised the joint probability of a waveform
x=\{x1,...,xT\}
p\theta
T | |
(x)=\prod | |
t=1 |
p(xt|x1,...,xt-1)
where
\theta
xt