Step-by-step NER Model for Bahasa Indonesia with PyTorch and Torchtext

3 min readAug 11, 2020

Attention heatmap for a sample news sentence, higher value means more attention.

About a month ago, I felt the need to refresh my memory on PyTorch. At the same time, I wanted to learn more about the recent development in Named Entity Recognition (NER), mainly because the topic is related to my thesis. Strangely, I could not find any comprehensive tutorial. Some tutorials are just an introduction to how PyTorch works in general, which was too simple. Other tutorials contain walls of code, trying to implement every technique that has seen the light of day. If there were something in between, they mixed PyTorch with Keras, rather than using Torchtext (I demand purity!). In the absence of a suitable reference, I start a step-by-step implementation.

To make the learning more concrete, I pick NER for Bahasa Indonesia as the use case, focusing on news articles. If you want to try the final best model in action immediately, I have deployed it to Heroku. Since the model works for Bahasa Indonesia, the website is also written in the same language.

An example of the model in action for a sample news sentence: http://nerindo-simple.herokuapp.com

I split the implementation into eight standalone Google Colab notebooks. The starting point is the Bi-directional LSTM model, which has been proven as a success for NER in many languages for the last 3–4 years. Following the bare minimal model, I added embeddings and Conditional Random Field (CRF). The former is a powerful idea in NLP to represent categorical input as numerical vectors, while the latter is an old concept to impose sequential rules to the model. Additionally, I adopt the attention layer and the transformer, which is based on the attention layer. Finally, the last two parts concern the experimentation to find out which model is the best.

More detail, visit the notebooks:

[1/8] Bare Minimal Bi-directional LSTM

[2/8] Word Embedding

[3/8] Character Embedding

[4/8] Conditional Random Field

[5/8] Attention Layer

[6/8] Transformer

[7/8] Experiment Settings

[8/8] Optimization

Final Remark

Comparison of the five configurations. BiLSTM models still perform better than the transformer-based.

As the primary goal of this journey is for learning, the model is by any means imperfect. There is no guarantee that it is state-of-the-art or even close to being one. In terms of methods, I keep everything to the core. Hopefully, the step-by-step notebooks can help you understand each component more clearly and why we should include them.

The complete implementation and dataset can be seen in this repository.