WaveNet

paper review

The WaveNet paper is kind of old. Yet it seems to come up in various contexts. Some thoughts on this.
deep learning
data science
NLP
Author

Oren Bochman

Published

Sunday, August 29, 2021

The WaveNet paper is a few years old. But it seems to come up in various contexts - mostly in the area of sound synthesis. It was the first paper to do waveform generation directly from a neural net instead of modeling vocoder parameters, used an innovative convolutional net technique and to top it all is an autoregressive model which was some years before probabilistic neural networks were being used. Here is some info on it and some follow up papers. (Perhaps I’ll split them up later.)

ABSTRACT

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-ofthe-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition. – (Oord et al. 2016)

Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. “Wavenet: A Generative Model for Raw Audio.” arXiv Preprint arXiv:1609.03499.

WaveNet

Is a fully probabilistic and autoregressive CNN model for the for TTS synthesis task. At this point CNN were mostly used for Image processing and RNN which are much harder to train were used for Sequence to Sequence modeling. The main WaveNet innovation is that it was a CNN that could handle contexts over lone term sequences (it needed to handle 16,000 samples per second over serval second).

WaveNet It was based on the PixelCNN van den Oord, Aaron, Kalchbrenner, Nal, and Kavukcuoglu, Koray Pixel recurrent neural networks 2016 van den Oord, Aaron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image generation with PixelCNN decoders 2016 architecture. To handle long-range temporal dependencies needed for raw audio generation, the authors developed a new architectures based on dilated causal convolutions, which exhibit very large receptive fields.

WaveNet is unique in its ability to synthesize sound directly where previous models required additional steps. Later TTS systems utilise WaveNet as a component, tweaking the joint language model with the linguistic context representations.

Typical TTS pipelines have two parts: 1. text_analysis(text sequence) - \implies sentence segmentation - \implies word segmentation - \implies Text normalization - \implies POS tagging - \implies grapheme to phoneme conversion - \implies phoneme sequence + linguistic contexts. 2. Speech synthesis(output) - \implies synthesized speech waveform. - prosody prediction - speech waveform generation

The two main approaches for the speech synthesis part are:

  1. non-parametric example-based AKA concatenative speech synthesis due to Moulines & Charpentier, 1990; Sagisaka et al.,1992; Hunt & Black, 1996 which builds up the utterance from units of recorded speech.
  2. parametric, model-based AKA statistical parametric speech synthesis due to (Yoshimura, 2002; Zen et al., 2009). which uses a generative model to synthesize the speech. The statistical parametric approach first extracts a sequence of vocoder parameters (Dudley, 1939) o = {o_1, ... , o_N } from speech signals x = {x_1, ... , x_T } and linguistic features l from the text W, where N and T correspond to the numbers of vocoder parameter vectors and speech signals.

Typically a vocoder parameter vector o_n is extracted at every 5 ms. It often includes:

- `cepstra` [[Imai & Furuichi, 1988]() or       `line spectral pairs` [Itakura, 1975]() which represent vocal tract transfer function.
- `fundamental frequency` $F_0$ and `aperiodicity` [Kawahara et al., 2001](), which represent characteristics of vocal source excitation signals. 

Then a set of generative models, such as a HMM Yoshimura, 2002, feed-forward neural network Zen et al., 2013; RNN Tuerk & Robinson, 1993; Karaali et al., 1997; Fan et al., 2014, is trained from the extracted vocoder parameters and linguistic features.

The paper

paper

Resources

Citation

BibTeX citation:
@online{bochman2021,
  author = {Bochman, Oren},
  title = {WaveNet},
  date = {2021-08-29},
  url = {https://orenbochman.github.io/posts/2021/2021-09-08-wave-net-review/},
  langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2021. “WaveNet.” August 29, 2021. https://orenbochman.github.io/posts/2021/2021-09-08-wave-net-review/.