Deep Neural Networks - Notes for lecture 4d

Neuro-probabilistic language models

Neuro-probabilistic language models
deep learning
neural networks
notes
coursera
NLP
Author

Oren Bochman

Published

Monday, August 14, 2017

Lecture 4d: Neuro-probabilistic language models

This is the first of several applications of neural networks that we’ll studying in some detail, in this course.

Synonyms: word embedding; word feature vector; word encoding.

All of these describe the learned collection of numbers that is used to represent a word. “embedding” emphasizes that it’s a location in a high-dimensional space: it’s where the words are embedded in that space. When we check to see which words are close to each other, we’re thinking about that embedding.

“feature vector” emphasizes that it’s a vector instead of a scalar, and that it’s componential, i.e. composed of multiple feature values.

“encoding” is very generic and doesn’t emphasize anything specific. looks at the trigram model

A basic problem in speech recognition

  • We cannot identify phonemes perfectly in noisy speech
    • The acoustic input is often ambiguous: there are several different words that fit the acoustic signal equally well.
  • People use their understanding of the meaning of the utterance to hear the right words.
    • We do this unconsciously when we wreck a nice beach.
    • We are very good at it.
  • This means speech recognizers have to know which words are likely to come next and which are not.
    • Fortunately, words can be predicted quite well without full understanding.

The standard “trigram” method

  • Take a huge amount of text and count the frequencies of all triples of words.
  • Use these frequencies to make bets on the relative probabilities of words given the previous two words:

\frac{p(w_3=c|w_2=b,w_1=a)}{p(w_3=d|w_2=b,w_1=a)}=\frac{count(abc)}{count(abd)}

  • Until very recently this was the state-of-the-art.
  • We cannot use a much bigger context because there are too many possibilities to store and the counts would mostly be zero.
  • We have to “back-off” to digrams when the count for a trigram is too small.
    • The probability is not zero just because the count is zero!

Information that the trigram model fails to use

  • Suppose we have seen the sentence “the cat got squashed in the garden on friday”
  • This should help us predict words in the sentence “the dog got flattened in the yard on monday”
  • A trigram model does not understand the similarities between
    • cat/dog squashed/flattened garden/yard friday/monday
  • To overcome this limitation, we need to use the semantic and syntactic features of previous words to predict the features of the next word.
    • Using a feature representation also allows a context that contains many more previous words (e.g. 10).

Reuse

CC SA BY-NC-ND

Citation

BibTeX citation:
@online{bochman2017,
  author = {Bochman, Oren},
  title = {Deep {Neural} {Networks} - {Notes} for Lecture 4d},
  date = {2017-08-14},
  url = {https://orenbochman.github.io/notes/dnn/dnn-04/l_04d.html},
  langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2017. “Deep Neural Networks - Notes for Lecture 4d.” August 14, 2017. https://orenbochman.github.io/notes/dnn/dnn-04/l_04d.html.