Classification & Vector Spaces - Probability and Bayes’ Rule

Course 1 of NLP Specialization

Concepts, code snippets, and slide commentaries for this week’s lesson of the Course notes from the deeplearning.ai natural language programming specialization.
NLP
Coursera
notes
deeplearning.ai
course notes
Conditional Probability
Bayes rule
Naïve Bayes
Laplace smoothing
Log-likelihood
classification
sentiment analysis task
bibliography
Author

Oren Bochman

Published

Friday, October 23, 2020

import pandas as pd
import string 
raw_tweets=[
  "I am happy because I am learning NLP",
  "I am sad, I am not learning NLP",
  "I am happy, not sad",
  "I am sad, not happy",
]
def clean(tweet:str):
  return  tweet.translate(str.maketrans('', '', string.punctuation)).lower()
tweets = [clean(tweet) for tweet in raw_tweets]
labels=['+','-','+','-']
df = pd.DataFrame({'tweets': tweets, 'labels': labels})
df
tweets labels
0 i am happy because i am learning nlp +
1 i am sad i am not learning nlp -
2 i am happy not sad +
3 i am sad not happy -
import numpy as np
from collections import Counter
p_freq,n_freq = Counter(), Counter()
#print( df[df.labels == '+']['tweets'].to_list())
[p_freq.update(tweet.split()) for tweet in df[df.labels == '+']['tweets'].to_list()]
[n_freq.update(tweet.split()) for tweet in df[df.labels == '-']['tweets'].to_list()]
print(p_freq)
print(n_freq)
vocab = list(set(p_freq.keys()).union(set(n_freq.keys())))
pos_freq = [p_freq[word] for word in vocab ]
neg_freq = [n_freq[word] for word in vocab ]
vocab_df=pd.DataFrame({'vocab':vocab,'pos_freq':pos_freq,'neg_freq':neg_freq})
vocab_df['p_pos']=vocab_df.pos_freq/vocab_df.pos_freq.sum()
vocab_df['p_neg']=vocab_df.neg_freq/vocab_df.neg_freq.sum()
vocab_df['p_pos_sm']=(vocab_df.pos_freq+1)/(vocab_df.pos_freq.sum()+vocab_df.shape[1])
vocab_df['p_neg_sm']=(vocab_df.neg_freq+1)/(vocab_df.neg_freq.sum()+vocab_df.shape[1])
vocab_df['ratio']= vocab_df.p_pos_sm/vocab_df.p_neg_sm
vocab_df['lambda']= np.log(vocab_df.p_pos_sm/vocab_df.p_neg_sm)
pd.set_option('display.float_format', '{:.2f}'.format)
vocab_df
print(vocab_df.shape)
[None, None]
[None, None]
Counter({'i': 3, 'am': 3, 'happy': 2, 'because': 1, 'learning': 1, 'nlp': 1, 'not': 1, 'sad': 1})
Counter({'i': 3, 'am': 3, 'sad': 2, 'not': 2, 'learning': 1, 'nlp': 1, 'happy': 1})
(8, 9)
vocab pos_freq neg_freq p_pos p_neg p_pos_sm p_neg_sm ratio lambda
0 learning 1 1 0.08 0.08 0.11 0.11 1.06 0.05
1 nlp 1 1 0.08 0.08 0.11 0.11 1.06 0.05
2 am 3 3 0.23 0.23 0.22 0.21 1.06 0.05
3 happy 2 1 0.15 0.08 0.17 0.11 1.58 0.46
4 not 1 2 0.08 0.15 0.11 0.16 0.70 -0.35
5 because 1 0 0.08 0.00 0.11 0.05 2.11 0.75
6 i 3 3 0.23 0.23 0.22 0.21 1.06 0.05
7 sad 1 2 0.08 0.15 0.11 0.16 0.70 -0.35
from IPython.display import Markdown
from tabulate import tabulate
table = [["Sun",696000,1989100000],
         ["Earth",6371,5973.6],
         ["Moon",1737,73.5],
         ["Mars",3390,641.85]]
Markdown(tabulate(
  table, 
  headers=["Planet","R (km)", "mass (x 10^29 kg)"]
))
Table 1: Planets
Planet R (km) mass (x 10^29 kg)
Sun 696000 1.9891e+09
Earth 6371 5973.6
Moon 1737 73.5
Mars 3390 641.85

Reuse

CC SA BY-NC-ND

Citation

BibTeX citation:
@online{bochman2020,
  author = {Bochman, Oren},
  title = {Classification \& {Vector} {Spaces} - {Probability} and
    {Bayes’} {Rule}},
  date = {2020-10-23},
  url = {https://orenbochman.github.io/notes/deeplearning.ai-nlp-c1/l2-naive-bayes/code.html},
  langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2020. “Classification & Vector Spaces - Probability and Bayes’ Rule.” October 23, 2020. https://orenbochman.github.io/notes/deeplearning.ai-nlp-c1/l2-naive-bayes/code.html.