Classification & Vector Spaces - Probability and Bayes’ Rule

Course 1 of NLP Specialization

Concepts, code snippets, and slide commentaries for this week’s lesson of the Course notes from the deeplearning.ai natural language programming specialization.

NLP

Coursera

notes

deeplearning.ai

course notes

Conditional Probability

Bayes rule

Naïve Bayes

Laplace smoothing

Log-likelihood

classification

sentiment analysis task

bibliography

import pandas as pd
import string 
raw_tweets=[
  "I am happy because I am learning NLP",
  "I am sad, I am not learning NLP",
  "I am happy, not sad",
  "I am sad, not happy",
]
def clean(tweet:str):
  return  tweet.translate(str.maketrans('', '', string.punctuation)).lower()
tweets = [clean(tweet) for tweet in raw_tweets]
labels=['+','-','+','-']
df = pd.DataFrame({'tweets': tweets, 'labels': labels})
df

	tweets	labels
0	i am happy because i am learning nlp	+
1	i am sad i am not learning nlp	-
2	i am happy not sad	+
3	i am sad not happy	-

import numpy as np
from collections import Counter
p_freq,n_freq = Counter(), Counter()
#print( df[df.labels == '+']['tweets'].to_list())
[p_freq.update(tweet.split()) for tweet in df[df.labels == '+']['tweets'].to_list()]
[n_freq.update(tweet.split()) for tweet in df[df.labels == '-']['tweets'].to_list()]
print(p_freq)
print(n_freq)
vocab = list(set(p_freq.keys()).union(set(n_freq.keys())))
pos_freq = [p_freq[word] for word in vocab ]
neg_freq = [n_freq[word] for word in vocab ]
vocab_df=pd.DataFrame({'vocab':vocab,'pos_freq':pos_freq,'neg_freq':neg_freq})
vocab_df['p_pos']=vocab_df.pos_freq/vocab_df.pos_freq.sum()
vocab_df['p_neg']=vocab_df.neg_freq/vocab_df.neg_freq.sum()
vocab_df['p_pos_sm']=(vocab_df.pos_freq+1)/(vocab_df.pos_freq.sum()+vocab_df.shape[1])
vocab_df['p_neg_sm']=(vocab_df.neg_freq+1)/(vocab_df.neg_freq.sum()+vocab_df.shape[1])
vocab_df['ratio']= vocab_df.p_pos_sm/vocab_df.p_neg_sm
vocab_df['lambda']= np.log(vocab_df.p_pos_sm/vocab_df.p_neg_sm)
pd.set_option('display.float_format', '{:.2f}'.format)
vocab_df
print(vocab_df.shape)

[None, None]

[None, None]

Counter({'i': 3, 'am': 3, 'happy': 2, 'because': 1, 'learning': 1, 'nlp': 1, 'not': 1, 'sad': 1})
Counter({'i': 3, 'am': 3, 'sad': 2, 'not': 2, 'learning': 1, 'nlp': 1, 'happy': 1})
(8, 9)

	vocab	pos_freq	neg_freq	p_pos	p_neg	p_pos_sm	p_neg_sm	ratio	lambda
0	learning	1	1	0.08	0.08	0.11	0.11	1.06	0.05
1	nlp	1	1	0.08	0.08	0.11	0.11	1.06	0.05
2	am	3	3	0.23	0.23	0.22	0.21	1.06	0.05
3	happy	2	1	0.15	0.08	0.17	0.11	1.58	0.46
4	not	1	2	0.08	0.15	0.11	0.16	0.70	-0.35
5	because	1	0	0.08	0.00	0.11	0.05	2.11	0.75
6	i	3	3	0.23	0.23	0.22	0.21	1.06	0.05
7	sad	1	2	0.08	0.15	0.11	0.16	0.70	-0.35

from IPython.display import Markdown
from tabulate import tabulate
table = [["Sun",696000,1989100000],
         ["Earth",6371,5973.6],
         ["Moon",1737,73.5],
         ["Mars",3390,641.85]]
Markdown(tabulate(
  table, 
  headers=["Planet","R (km)", "mass (x 10^29 kg)"]
))

Table 1: Planets

Planet	R (km)	mass (x 10^29 kg)
Sun	696000	1.9891e+09
Earth	6371	5973.6
Moon	1737	73.5
Mars	3390	641.85

Reuse

CC SA BY-NC-ND

Citation

BibTeX citation:

@online{bochman2020,
  author = {Bochman, Oren},
  title = {Classification \& {Vector} {Spaces} - {Probability} and
    {Bayes’} {Rule}},
  date = {2020-10-23},
  url = {https://orenbochman.github.io/notes/deeplearning.ai-nlp-c1/l2-naive-bayes/code.html},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2020. “Classification & Vector Spaces - Probability and Bayes’ Rule.” October 23, 2020. https://orenbochman.github.io/notes/deeplearning.ai-nlp-c1/l2-naive-bayes/code.html.