Candidates from String Edits – NLP Course Notes & Research

course banner

Estimated Time: 20 minutes

Create a list of candidate strings by applying an edit operation

Imports and Data

# data
word = 'dearz' # 🦌

Splits

Find all the ways we can split a word into 2 parts !

# splits with a loop
splits_a = []
for i in range(len(word)+1):
    splits_a.append([word[:i],word[i:]])

for i in splits_a:
    print(i)

['', 'dearz']
['d', 'earz']
['de', 'arz']
['dea', 'rz']
['dear', 'z']
['dearz', '']

# same splits, done using a list comprehension
splits_b = [(word[:i], word[i:]) for i in range(len(word) + 1)]

for i in splits_b:
    print(i)

('', 'dearz')
('d', 'earz')
('de', 'arz')
('dea', 'rz')
('dear', 'z')
('dearz', '')

Delete Edit

Delete a letter from each string in the splits list.

What this does is effectivly delete each possible letter from the original word being edited.

# deletes with a loop
splits = splits_a
deletes = []

print('word : ', word)
for L,R in splits:
    if R:
        print(L + R[1:], ' <-- delete ', R[0])

word :  dearz
earz  <-- delete  d
darz  <-- delete  e
derz  <-- delete  a
deaz  <-- delete  r
dear  <-- delete  z

It’s worth taking a closer look at how this is excecuting a ‘delete’.

Taking the first item from the splits list :

# breaking it down
print('word : ', word)
one_split = splits[0]
print('first item from the splits list : ', one_split)
L = one_split[0]
R = one_split[1]
print('L : ', L)
print('R : ', R)
print('*** now implicit delete by excluding the leading letter ***')
print('L + R[1:] : ',L + R[1:], ' <-- delete ', R[0])

word :  dearz
first item from the splits list :  ['', 'dearz']
L :  
R :  dearz
*** now implicit delete by excluding the leading letter ***
L + R[1:] :  earz  <-- delete  d

So the end result transforms ‘dearz’ to ‘earz’ by deleting the first character.

And we use a loop (code block above) or a list comprehension (code block below) to do this for the entire splits list.

# deletes with a list comprehension
splits = splits_a
deletes = [L + R[1:] for L, R in splits if R]

print(deletes)
print('*** which is the same as ***')
for i in deletes:
    print(i)

['earz', 'darz', 'derz', 'deaz', 'dear']
*** which is the same as ***
earz
darz
derz
deaz
dear

Ungraded Exercise

We now have a list of candidate strings created after performing a delete edit.
Next step will be to filter this list for candidate words found in a vocabulary.
Given the example vocab below, can we think of a way to create a list of candidate words ?
Remember, we already have a list of candidate strings, some of which are certainly not actual words we might find in your vocabulary !

So from the above list earz, darz, derz, deaz, dear.
You’re really only interested in dear.

vocab = ['dean','deer','dear','fries','and','coke']
edits = list(deletes)

print('vocab : ', vocab)
print('edits : ', edits)

candidates=[]

### START CODE HERE ###
candidates = set(vocab).intersection(edits)  # hint: 'set.intersection'
### END CODE HERE ###

print('candidate words : ', candidates)

vocab :  ['dean', 'deer', 'dear', 'fries', 'and', 'coke']
edits :  ['earz', 'darz', 'derz', 'deaz', 'dear']
candidate words :  {'dear'}

Expected Outcome:

vocab : [‘dean’, ‘deer’, ‘dear’, ‘fries’, ‘and’, ‘coke’]

edits : [‘earz’, ‘darz’, ‘derz’, ‘deaz’, ‘dear’]

candidate words : {‘dear’}

Summary

You’ve unpacked an integral part of the assignment by breaking down splits and edits, specifically looking at deletes here.

Implementation of the other edit types (insert, replace, switch) follows a similar methodology and should now feel somewhat familiar when we see them.

This bit of the code isn’t as intuitive as other sections, so well done!

We should now feel confident facing some of the more technical parts of the assignment at the end of the week.

Citation

BibTeX citation:

@online{bochman2020,
  author = {Bochman, Oren},
  title = {Candidates from {String} {Edits}},
  date = {2020-10-17},
  url = {https://orenbochman.github.io/notes-nlp/notes/c2w1/lab02.html},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2020. “Candidates from String Edits.” October 17, 2020. https://orenbochman.github.io/notes-nlp/notes/c2w1/lab02.html.