Building a Tokenizer from Scratch: A Simple Guide

Charan H U
4 min readNov 12, 2024

--

Posted on November 12, 2024

Natural Language Processing (NLP) has seen tremendous growth in recent years, powering applications like chatbots, language translation, sentiment analysis, and more. At the heart of NLP lies tokenization — the process of breaking down text into smaller units called tokens. Tokens can be words, subwords, or even characters, depending on the language and the specific needs of your application.

In this blog post, we’ll walk through building a tokenizer from scratch using the Hugging Face tokenizers library. We'll keep things simple and user-friendly, ensuring that even if you're new to NLP or programming, you'll be able to follow along.

Why Build Your Own Tokenizer?

While there are many pre-trained tokenizers available, building your own tokenizer allows you to:

  • Customize it for a specific language or domain (e.g., medical texts, legal documents).
  • Optimize performance for specific tasks.
  • Understand the inner workings of tokenization, which can help in debugging and improving NLP models.

Prerequisites

Before we start, make sure you have the following:

  • Basic Python knowledge.
  • An installed Python environment (Python 3.7 or higher).
  • Familiarity with the command line.
  • Internet connection (for downloading datasets).

Setup

1. Install Required Libraries

We’ll use the following libraries:

  • tokenizers: A fast and flexible library for tokenization.
  • datasets: To load and manage datasets.

Install them using pip:

pip install tokenizers datasets

2. Import Libraries

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers import normalizers
from tokenizers.normalizers import NFKC
from datasets import load_dataset

Step-by-Step Guide to Building a Tokenizer

We’ll build a Byte-Pair Encoding (BPE) tokenizer, which is effective for languages with rich morphology and large vocabularies.

Step 1: Choose a Dataset

For this example, we’ll use the WikiText-2 dataset, which contains a large amount of English text suitable for training a tokenizer.

dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

Step 2: Prepare the Training Corpus

We need to create a generator that yields batches of text for the tokenizer to process.

def get_training_corpus():
for i in range(0, len(dataset), 1000):
yield dataset[i : i + 1000]["text"]

Step 3: Initialize the Tokenizer

We start by creating a Tokenizer instance with a BPE model.

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
  • unk_token: Specifies the token to use for unknown words.

Step 4: Set the Normalizer

Normalization standardizes text, handling issues like accented characters and case sensitivity.

tokenizer.normalizer = normalizers.Sequence([
NFKC(), # Normalization Form Compatibility Composition
# You can add more normalizers if needed
])

Step 5: Set the Pre-tokenizer

Pre-tokenization splits the text into initial tokens, typically based on whitespace.

tokenizer.pre_tokenizer = Whitespace()

Step 6: Train the Tokenizer

We define a trainer that specifies how we want to train our tokenizer.

special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = BpeTrainer(vocab_size=30000, special_tokens=special_tokens)
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)
  • vocab_size: The size of the tokenizer's vocabulary.
  • special_tokens: Tokens that have special meanings in models (e.g., padding or unknown tokens).

Step 7: Test the Tokenizer

Let’s see how our tokenizer works on a sample text.

encoding = tokenizer.encode("Hello, how are you doing today?")
print("Tokens:", encoding.tokens)

Output:

Tokens: ['Hello', ',', 'how', 'are', 'you', 'doing', 'today', '?']

Step 8: Save the Tokenizer

We can save the tokenizer to a file for later use.

tokenizer.save("my_tokenizer.json")

Step 9: Load and Use the Tokenizer

Later, we can load the tokenizer from the file:

from tokenizers import Tokenizer
loaded_tokenizer = Tokenizer.from_file("my_tokenizer.json")
encoding = loaded_tokenizer.encode("This is a test.")
print("Tokens:", encoding.tokens)

Output:

Tokens: ['This', 'is', 'a', 'test', '.']

Customizing the Tokenizer

Adding Post-processing

You might want to add special tokens at the beginning or end of sentences, which is common in models like BERT.

from tokenizers.processors import TemplateProcessing
tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
special_tokens=[
("[CLS]", tokenizer.token_to_id("[CLS]")),
("[SEP]", tokenizer.token_to_id("[SEP]")),
],
)

Using the Tokenizer with Hugging Face Transformers

If you plan to use the tokenizer with Hugging Face’s transformers library, you can wrap it with PreTrainedTokenizerFast.

from transformers import PreTrainedTokenizerFast
wrapped_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

Now you can use it like any other tokenizer in the transformers library.

Building a Tokenizer for Another Language

Suppose you want to build a tokenizer for a language other than English, like Kannada, a language spoken in India.

Step 1: Load a Dataset in Kannada

dataset = load_dataset("oscar", "unshuffled_deduplicated_kn", split="train")

Step 2: Prepare the Training Corpus

def get_training_corpus():
for i in range(0, len(dataset), 1000):
yield dataset[i : i + 1000]["text"]

Step 3 to Step 8: Repeat the Previous Steps

Follow the same steps as before to initialize, train, and save your tokenizer.

Testing the Kannada Tokenizer

encoding = tokenizer.encode("ನಮಸ್ಕಾರ, ನೀವು ಹೇಗಿದ್ದೀರಿ?")
print("Tokens:", encoding.tokens)

Output:

Tokens: ['ನಮ', 'ಸ್ಕಾ', 'ರ', ',', 'ನೀ', 'ವು', 'ಹೇ', 'ಗಿದ್ದೀರಿ', '?']

Conclusion

Building a tokenizer from scratch might seem daunting at first, but with the right tools and a step-by-step approach, it’s quite manageable. By creating your own tokenizer, you gain:

  • Control over the tokenization process, tailoring it to your specific needs.
  • Insights into how tokenization affects model performance.
  • The ability to work with underrepresented languages or specialized domains.

Remember, tokenization is a crucial step in NLP pipelines. A well-designed tokenizer can significantly impact the effectiveness of your models.

Further Reading

FAQs

1. Why use Byte-Pair Encoding (BPE)?

BPE helps in handling out-of-vocabulary words by breaking them into subwords, which is especially useful for languages with rich morphology.

2. Can I use this tokenizer with any language?

Yes, the Hugging Face tokenizers library supports multiple languages. Ensure you have a suitable dataset in the target language.

3. How do I choose the right vocabulary size?

It depends on your specific application and dataset size. Larger vocabularies capture more unique tokens but require more memory.

4. What are special tokens, and why are they important?

Special tokens like [PAD], [UNK], [CLS], etc., are used by models to handle padding, unknown words, sentence classification tasks, and more.

Acknowledgments

  • Hugging Face for their excellent libraries and documentation.
  • The NLP Community for continuous learning and support.

--

--

Charan H U
Charan H U

Written by Charan H U

Applied AI Engineer | Internet Content Creator

No responses yet