# data
= 'dearz' # 🦌 word
Estimated Time: 20 minutes
Create a list of candidate strings by applying an edit operation
Imports and Data
Splits
Find all the ways we can split a word into 2 parts !
# splits with a loop
= []
splits_a for i in range(len(word)+1):
splits_a.append([word[:i],word[i:]])
for i in splits_a:
print(i)
['', 'dearz']
['d', 'earz']
['de', 'arz']
['dea', 'rz']
['dear', 'z']
['dearz', '']
# same splits, done using a list comprehension
= [(word[:i], word[i:]) for i in range(len(word) + 1)]
splits_b
for i in splits_b:
print(i)
('', 'dearz')
('d', 'earz')
('de', 'arz')
('dea', 'rz')
('dear', 'z')
('dearz', '')
Delete Edit
Delete a letter from each string in the splits
list.
What this does is effectivly delete each possible letter from the original word being edited.
# deletes with a loop
= splits_a
splits = []
deletes
print('word : ', word)
for L,R in splits:
if R:
print(L + R[1:], ' <-- delete ', R[0])
word : dearz
earz <-- delete d
darz <-- delete e
derz <-- delete a
deaz <-- delete r
dear <-- delete z
It’s worth taking a closer look at how this is excecuting a ‘delete’.
Taking the first item from the splits
list :
# breaking it down
print('word : ', word)
= splits[0]
one_split print('first item from the splits list : ', one_split)
= one_split[0]
L = one_split[1]
R print('L : ', L)
print('R : ', R)
print('*** now implicit delete by excluding the leading letter ***')
print('L + R[1:] : ',L + R[1:], ' <-- delete ', R[0])
word : dearz
first item from the splits list : ['', 'dearz']
L :
R : dearz
*** now implicit delete by excluding the leading letter ***
L + R[1:] : earz <-- delete d
So the end result transforms ‘dearz’ to ‘earz’ by deleting the first character.
And we use a loop (code block above) or a list comprehension (code block below) to do this for the entire splits
list.
# deletes with a list comprehension
= splits_a
splits = [L + R[1:] for L, R in splits if R]
deletes
print(deletes)
print('*** which is the same as ***')
for i in deletes:
print(i)
['earz', 'darz', 'derz', 'deaz', 'dear']
*** which is the same as ***
earz
darz
derz
deaz
dear
Ungraded Exercise
We now have a list of candidate strings created after performing a delete edit.
Next step will be to filter this list for candidate words found in a vocabulary.
Given the example vocab below, can we think of a way to create a list of candidate words ?
Remember, we already have a list of candidate strings, some of which are certainly not actual words we might find in your vocabulary !
So from the above list earz, darz, derz, deaz, dear.
You’re really only interested in dear.
= ['dean','deer','dear','fries','and','coke']
vocab = list(deletes)
edits
print('vocab : ', vocab)
print('edits : ', edits)
=[]
candidates
### START CODE HERE ###
= set(vocab).intersection(edits) # hint: 'set.intersection'
candidates ### END CODE HERE ###
print('candidate words : ', candidates)
vocab : ['dean', 'deer', 'dear', 'fries', 'and', 'coke']
edits : ['earz', 'darz', 'derz', 'deaz', 'dear']
candidate words : {'dear'}
Expected Outcome:
vocab : [‘dean’, ‘deer’, ‘dear’, ‘fries’, ‘and’, ‘coke’]
edits : [‘earz’, ‘darz’, ‘derz’, ‘deaz’, ‘dear’]
candidate words : {‘dear’}
Summary
You’ve unpacked an integral part of the assignment by breaking down splits and edits, specifically looking at deletes here.
Implementation of the other edit types (insert, replace, switch) follows a similar methodology and should now feel somewhat familiar when we see them.
This bit of the code isn’t as intuitive as other sections, so well done!
We should now feel confident facing some of the more technical parts of the assignment at the end of the week.
Citation
@online{bochman2020,
author = {Bochman, Oren},
title = {Candidates from {String} {Edits}},
date = {2020-10-17},
url = {https://orenbochman.github.io/notes-nlp/notes/c2w1/lab02.html},
langid = {en}
}