# imports
import re # regular expression library; for tokenization of words
from collections import Counter # collections library; counter: dict subclass for counting hashable objects
import matplotlib.pyplot as plt # for data visualization
Estimated Time: 10 minutes
Vocabulary Creation
Create a tiny vocabulary from a tiny corpus
It’s time to start small !
Imports and Data
# the tiny corpus of text !
= 'red pink pink blue blue yellow ORANGE BLUE BLUE PINK' # 🌈
text print(text)
print('string length : ',len(text))
red pink pink blue blue yellow ORANGE BLUE BLUE PINK
string length : 52
Preprocessing
# convert all letters to lower case
= text.lower()
text_lowercase print(text_lowercase)
print('string length : ',len(text_lowercase))
red pink pink blue blue yellow orange blue blue pink
string length : 52
# some regex to tokenize the string to words and return them in a list
= re.findall(r'\w+', text_lowercase)
words print(words)
print('count : ',len(words))
['red', 'pink', 'pink', 'blue', 'blue', 'yellow', 'orange', 'blue', 'blue', 'pink']
count : 10
Create Vocabulary
Option 1 : A set of distinct words from the text
# create vocab
= set(words)
vocab print(vocab)
print('count : ',len(vocab))
{'blue', 'red', 'yellow', 'orange', 'pink'}
count : 5
Add Information with Word Counts
Option 2 : Two alternatives for including the word count as well
# create vocab including word count
= dict()
counts_a for w in words:
= counts_a.get(w,0)+1
counts_a[w] print(counts_a)
print('count : ',len(counts_a))
{'red': 1, 'pink': 3, 'blue': 4, 'yellow': 1, 'orange': 1}
count : 5
# create vocab including word count using collections.Counter
= dict()
counts_b = Counter(words)
counts_b print(counts_b)
print('count : ',len(counts_b))
Counter({'blue': 4, 'pink': 3, 'red': 1, 'yellow': 1, 'orange': 1})
count : 5
Ungraded Exercise
Note that counts_b
, above, returned by collections.Counter
is sorted by word count
Can we modify the tiny corpus of text so that a new color appears between pink and red in counts_b
?
Do we need to run all the cells again, or just specific ones ?
print('counts_b : ', counts_b)
print('count : ', len(counts_b))
counts_b : Counter({'blue': 4, 'pink': 3, 'red': 1, 'yellow': 1, 'orange': 1})
count : 5
Expected Outcome:
counts_b : Counter({‘blue’: 4, ‘pink’: 3, ‘your_new_color_here’: 2, red’: 1, ‘yellow’: 1, ‘orange’: 1})
count : 6
Summary
This is a tiny example but the methodology scales very well.
In the assignment we will create a large vocabulary of thousands of words, from a corpus
of tens of thousands or words! But the mechanics are exactly the same.
The only extra things to pay attention to should be; run time, memory management and the vocab data structure.
So the choice of approach used in code blocks counts_a
vs counts_b
, above, will be important.
Citation
@online{bochman2020,
author = {Bochman, Oren},
title = {Building the Vocabulary},
date = {2020-10-16},
url = {https://orenbochman.github.io/notes-nlp/notes/c2w1/lab01.html},
langid = {en}
}