WaveNet Review

The WaveNet paper is kind of old. Yet it seems to come up in various contexts. Some thoughts on this.
deep learning
data science
NLP
Author

Oren Bochman

Published

Sunday, August 29, 2021

The WaveNet paper is a few years old. But it seems to come up in various contexts - mostly in the area of sound synthesis. It was the first paper to do waveform generation directly from a neural net instead of modeling vocoder parameters, used an innovative convolutional net technique and to top it all is an autoregressive model which was some years before probabilistic neural networks were being used. Here is some info on it and some follow up papers. (Perhaps I’ll split them up later.)

WaveNet

Is a fully probabilistic and autoregressive CNN model for the for TTS synthesis task. At this point CNN were mostly used for Image processing and RNN which are much harder to train were used for Sequence to Sequence modeling. The main WaveNet innovation is that it was a CNN that could handle contexts over lone term sequences (it needed to handle 16,000 samples per second over serval second).

WaveNet It was based on the PixelCNN van den Oord, Aaron, Kalchbrenner, Nal, and Kavukcuoglu, Koray Pixel recurrent neural networks 2016 van den Oord, Aaron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image generation with PixelCNN decoders 2016 architecture. To handle long-range temporal dependencies needed for raw audio generation, the authors developed a new architectures based on dilated causal convolutions, which exhibit very large receptive fields.

WaveNet is unique in its ability to synthesize sound directly where previous models required additional steps. Later TTS systems utilise WaveNet as a component, tweaking the joint language model with the linguistic context representations.

Typical TTS pipelines have two parts: 1. text_analysis(text sequence) - \implies sentence segmentation - \implies word segmentation - \implies Text normalization - \implies POS tagging - \implies grapheme to phoneme conversion - \implies phoneme sequence + linguistic contexts. 2. Speech synthesis(output) - \implies synthesized speech waveform. - prosody prediction - speech waveform generation

The two main approaches for the speech synthesis part are:

  1. non-parametric example-based AKA concatenative speech synthesis due to Moulines & Charpentier, 1990; Sagisaka et al.,1992; Hunt & Black, 1996 which builds up the utterance from units of recorded speech.
  2. parametric, model-based AKA statistical parametric speech synthesis due to (Yoshimura, 2002; Zen et al., 2009). which uses a generative model to synthesize the speech. The statistical parametric approach first extracts a sequence of vocoder parameters (Dudley, 1939) o = {o_1, ... , o_N } from speech signals x = {x_1, ... , x_T } and linguistic features l from the text W, where N and T correspond to the numbers of vocoder parameter vectors and speech signals.

Typically a vocoder parameter vector o_n is extracted at every 5 ms. It often includes:

- `cepstra` [[Imai & Furuichi, 1988]() or       `line spectral pairs` [Itakura, 1975]() which represent vocal tract transfer function.
- `fundamental frequency` $F_0$ and `aperiodicity` [Kawahara et al., 2001](), which represent characteristics of vocal source excitation signals. 

Then a set of generative models, such as a HMM Yoshimura, 2002, feed-forward neural network Zen et al., 2013; RNN Tuerk & Robinson, 1993; Karaali et al., 1997; Fan et al., 2014, is trained from the extracted vocoder parameters and linguistic features.

Resources

Citation

BibTeX citation:
@online{bochman2021,
  author = {Bochman, Oren},
  title = {WaveNet {Review}},
  date = {2021-08-29},
  url = {https://orenbochman.github.io/posts/2021/2021-09-08-wave-net-review/2021-09-08-wave-net-review.html},
  langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2021. “WaveNet Review.” August 29, 2021. https://orenbochman.github.io/posts/2021/2021-09-08-wave-net-review/2021-09-08-wave-net-review.html.