SentencePiece and Byte Pair Encoding – NLP Course Notes & Research

course banner

Introduction to Tokenization

In order to process text in neural network models, it is first required to encode text as numbers with ids (such as the embedding vectors we’ve been using in the previous assignments), since the tensor operations act on numbers. Finally, if the output of the network are words, it is required to decode the predicted tokens ids back to text.

To encode text, the first decision that has to be made is to what level of granularity are we going to consider the text? Because ultimately, from these tokens, features are going to be created about them. Many different experiments have been carried out using words, morphological units, phonemic units, characters. For example,

Tokens are tricky. (raw text)
Tokens are tricky . (words)
Token s _ are _ trick _ y . (morphemes)
t oʊ k ə n z _ ɑː _ ˈt r ɪ k i. (phonemes, for STT)
T o k e n s _ a r e _ t r i c k y . (character)

But how to identify these units, such as words, are largely determined by the language they come from. For example, in many European languages a space is used to separate words, while in some Asian languages there are no spaces between words. Compare English and Mandarin.

Tokens are tricky. (original sentence)
令牌很棘手 (Mandarin)
Lìng pái hěn jí shǒu (pinyin)
令牌很棘手 (Mandarin with spaces)

So, the ability to tokenize, i.e. split text into meaningful fundamental units is not always straight-forward.

Also, there are practical issues of how large our vocabulary of words, vocab_size, should be, considering memory limitations vs. coverage. A compromise between the finest-grained models employing characters which can be memory and more computationally efficient subword units such as n-grams or larger units need to be made.

In SentencePiece unicode characters are grouped together using either a unigram language model (used in this week’s assignment) or BPE, byte-pair encoding. We will discuss BPE, since BERT and many of its variant uses a modified version of BPE and its pseudocode is easy to implement and understand… hopefully!

SentencePiece Preprocessing

NFKC Normalization

Unsurprisingly, even using unicode to initially tokenize text can be ambiguous, e.g.,

eaccent = '\u00E9'
e_accent = '\u0065\u0301'
print(f'{eaccent} = {e_accent} : {eaccent == e_accent}')

é = é : False

SentencePiece uses the Unicode standard Normalization form, NFKC, so this isn’t an issue. Looking at our example from above again with normalization:

from unicodedata import normalize

norm_eaccent = normalize('NFKC', '\u00E9')
norm_e_accent = normalize('NFKC', '\u0065\u0301')
print(f'{norm_eaccent} = {norm_e_accent} : {norm_eaccent == norm_e_accent}')

é = é : True

Normalization has actually changed the unicode code point (unicode unique id) for one of these two characters.

def get_hex_encoding(s):
    return ' '.join(hex(ord(c)) for c in s)

def print_string_and_encoding(s):
    print(f'{s} : {get_hex_encoding(s)}')

for s in [eaccent, e_accent, norm_eaccent, norm_e_accent]:
    print_string_and_encoding(s)

é : 0xe9
é : 0x65 0x301
é : 0xe9
é : 0xe9

This normalization has other side effects which may be considered useful such as converting curly quotes “ to ” their ASCII equivalent. (Although we now lose directionality of the quote…)

Lossless Tokenization^*

SentencePiece also ensures that when you tokenize your data and detokenize your data the original position of white space is preserved. (However, tabs and newlines are converted to spaces, please try this experiment yourself later below.)

To ensure this lossless tokenization it replaces white space with _ (U+2581). So that a simple join of the replace underscores with spaces can restore the white space, even if there are consecutives symbols. But remember first to normalize and then replace spaces with _ (U+2581). As the following example shows.

s = 'Tokenization is hard.'
s_ = s.replace(' ', '\u2581')
s_n = normalize('NFKC', 'Tokenization is hard.')

print(get_hex_encoding(s))
print(get_hex_encoding(s_))
print(get_hex_encoding(s_n))

0x54 0x6f 0x6b 0x65 0x6e 0x69 0x7a 0x61 0x74 0x69 0x6f 0x6e 0x20 0x69 0x73 0x20 0x68 0x61 0x72 0x64 0x2e
0x54 0x6f 0x6b 0x65 0x6e 0x69 0x7a 0x61 0x74 0x69 0x6f 0x6e 0x2581 0x69 0x73 0x2581 0x68 0x61 0x72 0x64 0x2e
0x54 0x6f 0x6b 0x65 0x6e 0x69 0x7a 0x61 0x74 0x69 0x6f 0x6e 0x20 0x69 0x73 0x20 0x68 0x61 0x72 0x64 0x2e

So the special unicode underscore was replaced by the ASCII unicode. Reversing the order, we see that the special unicode underscore was retained.

s = 'Tokenization is hard.'
sn = normalize('NFKC', 'Tokenization is hard.')
sn_ = s.replace(' ', '\u2581')

print(get_hex_encoding(s))
print(get_hex_encoding(sn))
print(get_hex_encoding(sn_))

0x54 0x6f 0x6b 0x65 0x6e 0x69 0x7a 0x61 0x74 0x69 0x6f 0x6e 0x20 0x69 0x73 0x20 0x68 0x61 0x72 0x64 0x2e
0x54 0x6f 0x6b 0x65 0x6e 0x69 0x7a 0x61 0x74 0x69 0x6f 0x6e 0x20 0x69 0x73 0x20 0x68 0x61 0x72 0x64 0x2e
0x54 0x6f 0x6b 0x65 0x6e 0x69 0x7a 0x61 0x74 0x69 0x6f 0x6e 0x2581 0x69 0x73 0x2581 0x68 0x61 0x72 0x64 0x2e

BPE Algorithm

Now that we have discussed the preprocessing that SentencePiece performs we will go get our data, preprocess, and apply the BPE algorithm. We will show how this reproduces the tokenization produced by training SentencePiece on our example dataset (from this week’s assignment).

Preparing our Data

First, we get our Squad data and process as above.

import ast

def convert_json_examples_to_text(filepath):
    example_jsons = list(map(ast.literal_eval, open(filepath))) # Read in the json from the example file
    texts = [example_json['text'].decode('utf-8') for example_json in example_jsons] # Decode the byte sequences
    text = '\n\n'.join(texts)       # Separate different articles by two newlines
    text = normalize('NFKC', text)  # Normalize the text

    with open('example.txt', 'w') as fw:
        fw.write(text)
    
    return text

text = convert_json_examples_to_text('data.txt')
print(text[:900])

Beginners BBQ Class Taking Place in Missoula!
Do you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.
He will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.
The cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.

Discussion in 'Mac OS X Lion (10.7)' started by axboi87, Jan 20, 2012.
I've got a 500gb internal drive and a 240gb SSD.
When trying to restore using di

In the algorithm the vocab variable is actually a frequency dictionary of the words. Further, those words have been prepended with an underscore to indicate that they are the beginning of a word. Finally, the characters have been delimited by spaces so that the BPE algorithm can group the most common characters together in the dictionary in a greedy fashion. We will see how that is exactly done shortly.

from collections import Counter

vocab = Counter(['\u2581' + word for word in text.split()])
vocab = {' '.join([l for l in word]): freq for word, freq in vocab.items()}

def show_vocab(vocab, end='\n', limit=20):
    shown = 0
    for word, freq in vocab.items():
        print(f'{word}: {freq}', end=end)
        shown +=1
        if shown > limit:
            break

show_vocab(vocab)

▁ B e g i n n e r s: 1
▁ B B Q: 3
▁ C l a s s: 2
▁ T a k i n g: 1
▁ P l a c e: 1
▁ i n: 15
▁ M i s s o u l a !: 1
▁ D o: 1
▁ y o u: 13
▁ w a n t: 1
▁ t o: 33
▁ g e t: 2
▁ b e t t e r: 2
▁ a t: 1
▁ m a k i n g: 2
▁ d e l i c i o u s: 1
▁ B B Q ?: 1
▁ Y o u: 1
▁ w i l l: 6
▁ h a v e: 4
▁ t h e: 31

We check the size of the vocabulary (frequency dictionary) because this is the one hyperparameter that BPE depends on crucially on how far it breaks up a word into SentencePieces. It turns out that for our trained model on our small dataset that 60% of 455 merges of the most frequent characters need to be done to reproduce the upperlimit of a 32K vocab_size over the entire corpus of examples.

print(f'Total number of unique words: {len(vocab)}')
print(f'Number of merges required to reproduce SentencePiece training on the whole corpus: {int(0.60*len(vocab))}')

Total number of unique words: 455
Number of merges required to reproduce SentencePiece training on the whole corpus: 273

BPE Algorithm

Directly from the BPE paper we have the following algorithm.

To understand what’s going on first take a look at the third function get_sentence_piece_vocab. It takes in the current vocab word-frequency dictionary and the fraction of the total vocab_size to merge characters in the words of the dictionary, num_merges times. Then for each merge operation it get_stats on how many of each pair of character sequences there are. It gets the most frequent pair of symbols as the best pair. Then it merges those pair of symbols (removes the space between them) in each word in the vocab that contains this best (= pair). Consquently, merge_vocab creates a new vocab, v_out. This process is repeated num_merges times and the result is the set of SentencePieces (keys of the final sp_vocab).

Please feel free to skip the below if the above description was enough.

In a little more detail then, we can see in get_stats we initially create a list of bigram frequencies (two character sequence) from our vocabulary. Later, this may include (trigrams, quadgrams, etc.). Note that the key of the pairs frequency dictionary is actually a 2-tuple, which is just shorthand notation for a pair.

In merge_vocab we take in an individual pair (of character sequences, note this is the most frequency best pair) and the current vocab as v_in. We create a new vocab, v_out, from the old by joining together the characters in the pair (removing the space), if they are present in the a word of the dictionary. Warning: the expression (?<!\S) means that either whitespace character follows before the bigram or there is nothing before (beginning of word) the bigram, similarly for (?!\S) for preceding whitespace or end of word.

import re, collections

def get_stats(vocab):
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[symbols[i], symbols[i+1]] += freq
    return pairs

def merge_vocab(pair, v_in):
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out

def get_sentence_piece_vocab(vocab, frac_merges=0.60):
    sp_vocab = vocab.copy()
    num_merges = int(len(sp_vocab)*frac_merges)
    
    for i in range(num_merges):
        pairs = get_stats(sp_vocab)
        best = max(pairs, key=pairs.get)
        sp_vocab = merge_vocab(best, sp_vocab)

    return sp_vocab

sp_vocab = get_sentence_piece_vocab(vocab)
show_vocab(sp_vocab)

▁B e g in n ers: 1
▁BBQ: 3
▁Cl ass: 2
▁T ak ing: 1
▁P la ce: 1
▁in: 15
▁M is s ou la !: 1
▁D o: 1
▁you: 13
▁w an t: 1
▁to: 33
▁g et: 2
▁be t ter: 2
▁a t: 1
▁mak ing: 2
▁d e l ic i ou s: 1
▁BBQ ?: 1
▁ Y ou: 1
▁will: 6
▁have: 4
▁the: 31

Train SentencePiece BPE Tokenizer on Example Data

Explore SentencePiece Model

First let us explore the SentencePiece model provided with this week’s assignment. Remember you can always use Python’s built in help command to see the documentation for any object or method.

import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file='sentencepiece.model')

help(sp)

Help on SentencePieceProcessor in module sentencepiece object:

class SentencePieceProcessor(builtins.object)
 |  SentencePieceProcessor(model_file=None, model_proto=None, out_type=<class 'int'>, add_bos=False, add_eos=False, reverse=False, emit_unk_piece=False, enable_sampling=False, nbest_size=-1, alpha=0.1, num_threads=-1)
 |  
 |  Methods defined here:
 |  
 |  CalculateEntropy(self, input, alpha, num_threads=None)
 |      Calculate sentence entropy
 |  
 |  Decode(self, input, out_type=<class 'str'>, num_threads=None)
 |      Decode processed id or token sequences.
 |      
 |      Args:
 |        out_type: output type. str, bytes or 'serialized_proto' or 'immutable_proto' (Default = str)
 |        num_threads: the number of threads used in the batch processing (Default = -1).
 |  
 |  DecodeIds(self, input, out_type=<class 'str'>, **kwargs)
 |  
 |  DecodeIdsAsImmutableProto(self, input, out_type='immutable_proto', **kwargs)
 |  
 |  DecodeIdsAsSerializedProto(self, input, out_type='serialized_proto', **kwargs)
 |  
 |  DecodePieces(self, input, out_type=<class 'str'>, **kwargs)
 |  
 |  DecodePiecesAsImmutableProto(self, input, out_type='immutable_proto', **kwargs)
 |  
 |  DecodePiecesAsSerializedProto(self, input, out_type='serialized_proto', **kwargs)
 |  
 |  Detokenize = Decode(self, input, out_type=<class 'str'>, num_threads=None)
 |  
 |  Encode(self, input, out_type=None, add_bos=None, add_eos=None, reverse=None, emit_unk_piece=None, enable_sampling=None, nbest_size=None, alpha=None, num_threads=None)
 |      Encode text input to segmented ids or tokens.
 |      
 |      Args:
 |      input: input string. accepsts list of string.
 |      out_type: output type. int or str.
 |      add_bos: Add <s> to the result (Default = false)
 |      add_eos: Add </s> to the result (Default = false) <s>/</s> is added after
 |               reversing (if enabled).
 |      reverse: Reverses the tokenized sequence (Default = false)
 |      emit_unk_piece: Emits the unk literal string (Default = false)
 |      nbest_size: sampling parameters for unigram. Invalid in BPE-Dropout.
 |                  nbest_size = {0,1}: No sampling is performed.
 |                  nbest_size > 1: samples from the nbest_size results.
 |                  nbest_size < 0: assuming that nbest_size is infinite and samples
 |                  from the all hypothesis (lattice) using
 |                  forward-filtering-and-backward-sampling algorithm.
 |      alpha: Soothing parameter for unigram sampling, and merge probability for
 |             BPE-dropout (probablity 'p' in BPE-dropout paper).
 |      num_threads: the number of threads used in the batch processing (Default = -1).
 |  
 |  EncodeAsIds(self, input, **kwargs)
 |  
 |  EncodeAsImmutableProto(self, input, **kwargs)
 |  
 |  EncodeAsPieces(self, input, **kwargs)
 |  
 |  EncodeAsSerializedProto(self, input, **kwargs)
 |  
 |  GetPieceSize(self)
 |  
 |  GetScore = _batched_func(self, arg)
 |  
 |  IdToPiece = _batched_func(self, arg)
 |  
 |  Init(self, model_file=None, model_proto=None, out_type=<class 'int'>, add_bos=False, add_eos=False, reverse=False, emit_unk_piece=False, enable_sampling=False, nbest_size=-1, alpha=0.1, num_threads=-1)
 |      Initialzie sentencepieceProcessor.
 |      
 |      Args:
 |        model_file: The sentencepiece model file path.
 |        model_proto: The sentencepiece model serialized proto.
 |        out_type: output type. int or str.
 |        add_bos: Add <s> to the result (Default = false)
 |        add_eos: Add </s> to the result (Default = false) <s>/</s> is added after
 |          reversing (if enabled).
 |        reverse: Reverses the tokenized sequence (Default = false)
 |        emit_unk_piece: Emits the unk literal string (Default = false)
 |        nbest_size: sampling parameters for unigram. Invalid in BPE-Dropout.
 |                    nbest_size = {0,1}: No sampling is performed.
 |                    nbest_size > 1: samples from the nbest_size results.
 |                    nbest_size < 0: assuming that nbest_size is infinite and samples
 |                      from the all hypothesis (lattice) using
 |                      forward-filtering-and-backward-sampling algorithm.
 |        alpha: Soothing parameter for unigram sampling, and dropout probability of
 |               merge operations for BPE-dropout.
 |        num_threads: number of threads in batch processing (Default = -1, auto-detected)
 |  
 |  IsByte = _batched_func(self, arg)
 |  
 |  IsControl = _batched_func(self, arg)
 |  
 |  IsUnknown = _batched_func(self, arg)
 |  
 |  IsUnused = _batched_func(self, arg)
 |  
 |  Load(self, model_file=None, model_proto=None)
 |      Overwride SentencePieceProcessor.Load to support both model_file and model_proto.
 |      
 |      Args:
 |        model_file: The sentencepiece model file path.
 |        model_proto: The sentencepiece model serialized proto. Either `model_file`
 |          or `model_proto` must be set.
 |  
 |  LoadFromFile(self, arg)
 |  
 |  LoadFromSerializedProto(self, serialized)
 |  
 |  LoadVocabulary(self, filename, threshold)
 |  
 |  NBestEncode(self, input, out_type=None, add_bos=None, add_eos=None, reverse=None, emit_unk_piece=None, nbest_size=None)
 |      NBestEncode text input to segmented ids or tokens.
 |      
 |      Args:
 |      input: input string. accepsts list of string.
 |      out_type: output type. int or str.
 |      add_bos: Add <s> to the result (Default = false)
 |      add_eos: Add </s> to the result (Default = false) <s>/</s> is added after reversing (if enabled).
 |      reverse: Reverses the tokenized sequence (Default = false)
 |      emit_unk_piece: Emits the unk literal string (Default = false)
 |      nbest_size: nbest size
 |  
 |  NBestEncodeAsIds(self, input, nbest_size=None, **kwargs)
 |  
 |  NBestEncodeAsImmutableProto(self, input, nbest_size=None, **kwargs)
 |  
 |  NBestEncodeAsPieces(self, input, nbest_size=None, **kwargs)
 |  
 |  NBestEncodeAsSerializedProto(self, input, nbest_size=None, **kwargs)
 |  
 |  Normalize(self, input, with_offsets=None)
 |  
 |  OverrideNormalizerSpec(self, **kwargs)
 |  
 |  PieceToId = _batched_func(self, arg)
 |  
 |  ResetVocabulary(self)
 |  
 |  SampleEncodeAndScore(self, input, out_type=None, add_bos=None, add_eos=None, reverse=None, emit_unk_piece=None, num_samples=None, alpha=None, wor=None, include_best=None)
 |      SampleEncodeAndScore text input to segmented ids or tokens.
 |      
 |      Args:
 |      input: input string. accepsts list of string.
 |      out_type: output type. int or str or 'serialized_proto' or 'immutable_proto'
 |      add_bos: Add <s> to the result (Default = false)
 |      add_eos: Add </s> to the result (Default = false) <s>/</s> is added after reversing (if enabled).
 |      reverse: Reverses the tokenized sequence (Default = false)
 |      emit_unk_piece: Emits the unk literal string (Default = false)
 |      num_samples: How many samples to return (Default = 1)
 |      alpha: inverse temperature for sampling
 |      wor: whether to sample without replacement (Default = false)
 |      include_best: whether to include the best tokenization, requires wor=True (Default = false)
 |  
 |  SampleEncodeAndScoreAsIds(self, input, num_samples=None, alpha=None, **kwargs)
 |  
 |  SampleEncodeAndScoreAsImmutableProto(self, input, num_samples=None, alpha=None, **kwargs)
 |  
 |  SampleEncodeAndScoreAsPieces(self, input, num_samples=None, alpha=None, **kwargs)
 |  
 |  SampleEncodeAndScoreAsSerializedProto(self, input, num_samples=None, alpha=None, **kwargs)
 |  
 |  SampleEncodeAsIds(self, input, nbest_size=None, alpha=None, **kwargs)
 |  
 |  SampleEncodeAsImmutableProto(self, input, nbest_size=None, alpha=None, **kwargs)
 |  
 |  SampleEncodeAsPieces(self, input, nbest_size=None, alpha=None, **kwargs)
 |  
 |  SampleEncodeAsSerializedProto(self, input, nbest_size=None, alpha=None, **kwargs)
 |  
 |  SetDecodeExtraOptions(self, extra_option)
 |  
 |  SetEncodeExtraOptions(self, extra_option)
 |  
 |  SetVocabulary(self, valid_vocab)
 |  
 |  Tokenize = Encode(self, input, out_type=None, add_bos=None, add_eos=None, reverse=None, emit_unk_piece=None, enable_sampling=None, nbest_size=None, alpha=None, num_threads=None)
 |  
 |  __getitem__(self, piece)
 |  
 |  __getstate__(self)
 |  
 |  __init__ = Init(self, model_file=None, model_proto=None, out_type=<class 'int'>, add_bos=False, add_eos=False, reverse=False, emit_unk_piece=False, enable_sampling=False, nbest_size=-1, alpha=0.1, num_threads=-1)
 |  
 |  __len__(self)
 |  
 |  __repr__ = _swig_repr(self)
 |  
 |  __setstate__(self, serialized_model_proto)
 |  
 |  bos_id(self)
 |  
 |  calculate_entropy = CalculateEntropy(self, input, alpha, num_threads=None)
 |  
 |  decode = Decode(self, input, out_type=<class 'str'>, num_threads=None)
 |  
 |  decode_ids = DecodeIds(self, input, out_type=<class 'str'>, **kwargs)
 |  
 |  decode_ids_as_immutable_proto = DecodeIdsAsImmutableProto(self, input, out_type='immutable_proto', **kwargs)
 |  
 |  decode_ids_as_serialized_proto = DecodeIdsAsSerializedProto(self, input, out_type='serialized_proto', **kwargs)
 |  
 |  decode_pieces = DecodePieces(self, input, out_type=<class 'str'>, **kwargs)
 |  
 |  decode_pieces_as_immutable_proto = DecodePiecesAsImmutableProto(self, input, out_type='immutable_proto', **kwargs)
 |  
 |  decode_pieces_as_serialized_proto = DecodePiecesAsSerializedProto(self, input, out_type='serialized_proto', **kwargs)
 |  
 |  detokenize = Decode(self, input, out_type=<class 'str'>, num_threads=None)
 |  
 |  encode = Encode(self, input, out_type=None, add_bos=None, add_eos=None, reverse=None, emit_unk_piece=None, enable_sampling=None, nbest_size=None, alpha=None, num_threads=None)
 |  
 |  encode_as_ids = EncodeAsIds(self, input, **kwargs)
 |  
 |  encode_as_immutable_proto = EncodeAsImmutableProto(self, input, **kwargs)
 |  
 |  encode_as_pieces = EncodeAsPieces(self, input, **kwargs)
 |  
 |  encode_as_serialized_proto = EncodeAsSerializedProto(self, input, **kwargs)
 |  
 |  eos_id(self)
 |  
 |  get_piece_size = GetPieceSize(self)
 |  
 |  get_score = _batched_func(self, arg)
 |  
 |  id_to_piece = _batched_func(self, arg)
 |  
 |  init = Init(self, model_file=None, model_proto=None, out_type=<class 'int'>, add_bos=False, add_eos=False, reverse=False, emit_unk_piece=False, enable_sampling=False, nbest_size=-1, alpha=0.1, num_threads=-1)
 |  
 |  is_byte = _batched_func(self, arg)
 |  
 |  is_control = _batched_func(self, arg)
 |  
 |  is_unknown = _batched_func(self, arg)
 |  
 |  is_unused = _batched_func(self, arg)
 |  
 |  load = Load(self, model_file=None, model_proto=None)
 |  
 |  load_from_file = LoadFromFile(self, arg)
 |  
 |  load_from_serialized_proto = LoadFromSerializedProto(self, serialized)
 |  
 |  load_vocabulary = LoadVocabulary(self, filename, threshold)
 |  
 |  nbest_encode = NBestEncode(self, input, out_type=None, add_bos=None, add_eos=None, reverse=None, emit_unk_piece=None, nbest_size=None)
 |  
 |  nbest_encode_as_ids = NBestEncodeAsIds(self, input, nbest_size=None, **kwargs)
 |  
 |  nbest_encode_as_immutable_proto = NBestEncodeAsImmutableProto(self, input, nbest_size=None, **kwargs)
 |  
 |  nbest_encode_as_pieces = NBestEncodeAsPieces(self, input, nbest_size=None, **kwargs)
 |  
 |  nbest_encode_as_serialized_proto = NBestEncodeAsSerializedProto(self, input, nbest_size=None, **kwargs)
 |  
 |  normalize = Normalize(self, input, with_offsets=None)
 |  
 |  override_normalizer_spec = OverrideNormalizerSpec(self, **kwargs)
 |  
 |  pad_id(self)
 |  
 |  piece_size(self)
 |  
 |  piece_to_id = _batched_func(self, arg)
 |  
 |  reset_vocabulary = ResetVocabulary(self)
 |  
 |  sample_encode_and_score = SampleEncodeAndScore(self, input, out_type=None, add_bos=None, add_eos=None, reverse=None, emit_unk_piece=None, num_samples=None, alpha=None, wor=None, include_best=None)
 |  
 |  sample_encode_and_score_as_ids = SampleEncodeAndScoreAsIds(self, input, num_samples=None, alpha=None, **kwargs)
 |  
 |  sample_encode_and_score_as_immutable_proto = SampleEncodeAndScoreAsImmutableProto(self, input, num_samples=None, alpha=None, **kwargs)
 |  
 |  sample_encode_and_score_as_pieces = SampleEncodeAndScoreAsPieces(self, input, num_samples=None, alpha=None, **kwargs)
 |  
 |  sample_encode_and_score_as_serialized_proto = SampleEncodeAndScoreAsSerializedProto(self, input, num_samples=None, alpha=None, **kwargs)
 |  
 |  sample_encode_as_ids = SampleEncodeAsIds(self, input, nbest_size=None, alpha=None, **kwargs)
 |  
 |  sample_encode_as_immutable_proto = SampleEncodeAsImmutableProto(self, input, nbest_size=None, alpha=None, **kwargs)
 |  
 |  sample_encode_as_pieces = SampleEncodeAsPieces(self, input, nbest_size=None, alpha=None, **kwargs)
 |  
 |  sample_encode_as_serialized_proto = SampleEncodeAsSerializedProto(self, input, nbest_size=None, alpha=None, **kwargs)
 |  
 |  serialized_model_proto(self)
 |  
 |  set_decode_extra_options = SetDecodeExtraOptions(self, extra_option)
 |  
 |  set_encode_extra_options = SetEncodeExtraOptions(self, extra_option)
 |  
 |  set_vocabulary = SetVocabulary(self, valid_vocab)
 |  
 |  tokenize = Encode(self, input, out_type=None, add_bos=None, add_eos=None, reverse=None, emit_unk_piece=None, enable_sampling=None, nbest_size=None, alpha=None, num_threads=None)
 |  
 |  unk_id(self)
 |  
 |  vocab_size(self)
 |  
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |  
 |  __swig_destroy__ = delete_SentencePieceProcessor(...)
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  thisown
 |      The membership flag

Let’s work with the first sentence of our example text.

s0 = 'Beginners BBQ Class Taking Place in Missoula!'

# encode: text => id
print(sp.encode_as_pieces(s0))
print(sp.encode_as_ids(s0))

# decode: id => text
print(sp.decode_pieces(sp.encode_as_pieces(s0)))
print(sp.decode_ids([12847, 277]))

['▁Beginn', 'ers', '▁BBQ', '▁Class', '▁', 'Taking', '▁Place', '▁in', '▁Miss', 'oul', 'a', '!']
[12847, 277, 15068, 4501, 3, 12297, 3399, 16, 5964, 7115, 9, 55]
Beginners BBQ Class Taking Place in Missoula!
Beginners

Notice how SentencePiece breaks the words into seemingly odd parts, but we’ve seen something similar from our work with BPE. But how close were we to this model trained on the whole corpus of examles with a vocab_size of 32,000 instead of 455? Here you can also test what happens to white space, like ‘’.

But first let us note that SentencePiece encodes the SentencePieces, the tokens, and has reserved some of the ids as can be seen in this week’s assignment.

uid = 15068
spiece = "\u2581BBQ"
unknown = "__MUST_BE_UNKNOWN__"

# id <=> piece conversion
print(f'SentencePiece for ID {uid}: {sp.id_to_piece(uid)}')
print(f'ID for Sentence Piece {spiece}: {sp.piece_to_id(spiece)}')

# returns 0 for unknown tokens (we can change the id for UNK)
print(f'ID for unknown text {unknown}: {sp.piece_to_id(unknown)}')

SentencePiece for ID 15068: ▁BBQ
ID for Sentence Piece ▁BBQ: 15068
ID for unknown text __MUST_BE_UNKNOWN__: 2

print(f'Beginning of sentence id: {sp.bos_id()}')
print(f'Pad id: {sp.pad_id()}')
print(f'End of sentence id: {sp.eos_id()}')
print(f'Unknown id: {sp.unk_id()}')
print(f'Vocab size: {sp.vocab_size()}')

Beginning of sentence id: -1
Pad id: 0
End of sentence id: 1
Unknown id: 2
Vocab size: 32000

We can also check what are the ids for the first part and last part of the vocabulary.

print('\nId\tSentP\tControl?')
print('------------------------')
# <unk>, <s>, </s> are defined by default. Their ids are (0, 1, 2)
# <s> and </s> are defined as 'control' symbol.
for uid in range(10):
    print(uid, sp.id_to_piece(uid), sp.is_control(uid), sep='\t')
    
# for uid in range(sp.vocab_size()-10,sp.vocab_size()):
#     print(uid, sp.id_to_piece(uid), sp.is_control(uid), sep='\t')


Id  SentP   Control?
------------------------
0   <pad>   True
1   </s>    True
2   <unk>   False
3   ▁   False
4   X   False
5   .   False
6   ,   False
7   s   False
8   ▁the    False
9   a   False

Train SentencePiece BPE model with our example.txt

Finally, let’s train our own BPE model directly from the SentencePiece library and compare it to the results of our implemention of the algorithm from the BPE paper itself.

spm.SentencePieceTrainer.train('--input=example.txt --model_prefix=example_bpe --vocab_size=450 --model_type=bpe')
sp_bpe = spm.SentencePieceProcessor()
sp_bpe.load('example_bpe.model')

print('*** BPE ***')
print(sp_bpe.encode_as_pieces(s0))

sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=example.txt --model_prefix=example_bpe --vocab_size=450 --model_type=bpe
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: example.txt
  input_format: 
  model_prefix: example_bpe
  model_type: BPE
  vocab_size: 450
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0
  differential_privacy_noise_level: 0
  differential_privacy_clipping_threshold: 0
}
normalizer_spec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv: 
}
denormalizer_spec {}
trainer_interface.cc(353) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(185) LOG(INFO) Loading corpus: example.txt
trainer_interface.cc(409) LOG(INFO) Loaded all 26 sentences
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <s>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(430) LOG(INFO) Normalizing sentences...
trainer_interface.cc(539) LOG(INFO) all chars count=4533
trainer_interface.cc(550) LOG(INFO) Done: 99.9559% characters are covered.
trainer_interface.cc(560) LOG(INFO) Alphabet size=73
trainer_interface.cc(561) LOG(INFO) Final character coverage=0.999559
trainer_interface.cc(592) LOG(INFO) Done! preprocessed 26 sentences.
trainer_interface.cc(598) LOG(INFO) Tokenizing input sentences with whitespace: 26
trainer_interface.cc(609) LOG(INFO) Done! 455
bpe_model_trainer.cc(159) LOG(INFO) Updating active symbols. max_freq=99 min_freq=1
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=31 size=20 all=732 active=658 piece=▁w
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=16 size=40 all=937 active=863 piece=ch
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=11 size=60 all=1014 active=940 piece=▁u
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=8 size=80 all=1110 active=1036 piece=me
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=6 size=100 all=1166 active=1092 piece=la
bpe_model_trainer.cc(159) LOG(INFO) Updating active symbols. max_freq=6 min_freq=0
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=5 size=120 all=1217 active=1042 piece=SD
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=5 size=140 all=1272 active=1097 piece=▁bu
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=5 size=160 all=1288 active=1113 piece=▁site
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=4 size=180 all=1315 active=1140 piece=ter
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=4 size=200 all=1330 active=1155 piece=asure
bpe_model_trainer.cc(159) LOG(INFO) Updating active symbols. max_freq=4 min_freq=0
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=3 size=220 all=1339 active=1008 piece=ge
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=3 size=240 all=1371 active=1040 piece=▁sh
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=3 size=260 all=1384 active=1053 piece=▁cost
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=2 size=280 all=1391 active=1060 piece=de
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=2 size=300 all=1405 active=1074 piece=000
bpe_model_trainer.cc(159) LOG(INFO) Updating active symbols. max_freq=2 min_freq=0
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=2 size=320 all=1427 active=1021 piece=▁GB
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=2 size=340 all=1438 active=1032 piece=last
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=2 size=360 all=1441 active=1035 piece=▁let
trainer_interface.cc(687) LOG(INFO) Saving model: example_bpe.model
trainer_interface.cc(699) LOG(INFO) Saving vocabs: example_bpe.vocab

True

*** BPE ***
['▁B', 'e', 'ginn', 'ers', '▁BBQ', '▁Cl', 'ass', '▁T', 'ak', 'ing', '▁P', 'la', 'ce', '▁in', '▁M', 'is', 's', 'ou', 'la', '!']

show_vocab(sp_vocab, end = ', ')

▁B e g in n ers: 1, ▁BBQ: 3, ▁Cl ass: 2, ▁T ak ing: 1, ▁P la ce: 1, ▁in: 15, ▁M is s ou la !: 1, ▁D o: 1, ▁you: 13, ▁w an t: 1, ▁to: 33, ▁g et: 2, ▁be t ter: 2, ▁a t: 1, ▁mak ing: 2, ▁d e l ic i ou s: 1, ▁BBQ ?: 1, ▁ Y ou: 1, ▁will: 6, ▁have: 4, ▁the: 31,

Our implementation of BPE’s code from the paper matches up pretty well with the library itself! Difference are probably accounted for by the vocab_size. There is also another technical difference in that in the SentencePiece implementation of BPE a priority queue is used to more efficiently keep track of the best pairs. Actually, there is a priority queue in the Python standard library called heapq if you would like to give that a try below!

from heapq import heappush, heappop

def heapsort(iterable):
    h = []
    for value in iterable:
        heappush(h, value)
    return [heappop(h) for i in range(len(h))]

a = [1,4,3,1,3,2,1,4,2]
heapsort(a)

[1, 1, 1, 2, 2, 3, 3, 4, 4]

For a more extensive example consider looking at the SentencePiece repo. The last section of this code is repurposed from that tutorial. Thanks for your participation!

Citation

BibTeX citation:

@online{bochman2021,
  author = {Bochman, Oren},
  title = {SentencePiece and {Byte} {Pair} {Encoding}},
  date = {2021-04-11},
  url = {https://orenbochman.github.io/notes-nlp/notes/c4w3/lab01.html},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2021. “SentencePiece and Byte Pair Encoding.” April 11, 2021. https://orenbochman.github.io/notes-nlp/notes/c4w3/lab01.html.

Introduction to Tokenization

SentencePiece Preprocessing

NFKC Normalization

Lossless Tokenization*

BPE Algorithm

Preparing our Data

BPE Algorithm

Train SentencePiece BPE Tokenizer on Example Data

Explore SentencePiece Model

Train SentencePiece BPE model with our example.txt

Citation

Lossless Tokenization^*