Token Classification (Named Entity Recognition) Model

TokenClassification Model supports Named entity recognition (NER) and other token level classification tasks, as long as the data follows the format specified below.

We’re going to use NER task through out this documentation. Named entity recognition (NER), also referred to as entity chunking, identification or extraction, is the task of detecting and classifying key information (entities) in text. In other words, a NER model takes a piece of text as input and for each word in the text, the model identifies a category the word belongs to. For example, in a sentence: Mary lives in Santa Clara and works at NVIDIA, the model should detect that Mary is a person, Santa Clara is a location and NVIDIA is a company.

Quick Start

from nemo.collections.nlp.models import TokenClassificationModel

# to get the list of pre-trained models
TokenClassificationModel.list_available_models()

# Download and load the pre-trained BERT-based model
model = TokenClassificationModel.from_pretrained("ner_en_bert")

# try the model on a few examples
model.add_predictions(['we bought four shirts from the nvidia gear store in santa clara.', 'NVIDIA is a company.'])

Note

We recommend you try this model in a Jupyter notebook (can run on Google’s Colab.): NeMo/tutorials/nlp/Token_Classification_Named_Entity_Recognition.ipynb.

Connect to an instance with a GPU (Runtime -> Change runtime type -> select “GPU” for hardware accelerator)

An example script on how to train the model could be found here: NeMo/examples/nlp/token_classification/token_classification_train.py.

An example script on how to run evaluation and inference could be found here: NeMo/examples/nlp/token_classification/token_classification_evaluate.py.

The default configuration file for the model could be found at: NeMo/examples/nlp/token_classification/conf/token_classification_config.yaml.

Data Input for Token Classification Model

For pre-training or fine-tuning of the model, the data should be split into 2 files:

  • text.txt

  • labels.txt

Each line of the text.txt file contains text sequences, where words are separated with spaces, i.e.: [WORD] [SPACE] [WORD] [SPACE] [WORD]. The labels.txt file contains corresponding labels for each word in text.txt, the labels are separated with spaces, i.e.: [LABEL] [SPACE] [LABEL] [SPACE] [LABEL]. Example of a text.txt file:

Jennifer is from New York City . She likes … …

Corresponding labels.txt file:

B-PER O O B-LOC I-LOC I-LOC O O O … …

Dataset Conversion

To convert an IOB format (short for inside, outside, beginning) data to the format required for training use examples/nlp/token_classification/data/import_from_iob_format.py.

# For conversion from IOB format, for example, for CoNLL-2003 dataset:
python import_from_iob_format.py --data_file=<PATH/TO/THE/FILE/IN/IOB/FORMAT>

Convert Dataset Required Arguments

  • --data_file: path to the file to convert from IOB to NeMo format

After running the above command, the data directory, where the --data_file is stored, should contain text_*.txt and labels_*.txt files. The default names for the training and evaluation in the conf/token_classification_config.yaml are the following:

.
|--data_dir
  |-- labels_dev.txt
  |-- labels_train.txt
  |-- text_dev.txt
  |-- text_train.txt

Training Token Classification Model

In the Token Classification Model, we are jointly training a classifier on top of a pre-trained language model, such as BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[NLP-NER1]. Unless the user provides a pre-trained checkpoint for the language model, the language model is initialized with the pre-trained model from HuggingFace Transformers.

Example of model configuration file for training the model could be found at: NeMo/examples/nlp/token_classification/conf/token_classification_config.yaml.

The specification can be roughly grouped into three categories:

  • Parameters that describe the training process: trainer

  • Parameters that describe the datasets: model.dataset, model.train_ds, model.validation_ds

  • Parameters that describe the model: model

More details about parameters in the spec file could be found below:

Parameter

Data Type

Description

model.dataset.data_dir

string

Path to the data converted to the specified above format

model.head.num_fc_layers

integer

Number of fully connected layers

model.head.fc_dropout

float

Dropout to apply to the input hidden states

model.head.activation

string

Activation to use between fully connected layers

model.punct_head.use_transrormer_init

bool

Whether to initialize the weights of the classifier head with the same approach used in Transformer

training_ds.text_file

string

Name of the text training file located at data_dir

training_ds.labels_file

string

Name of the labels training file located at data_dir

training_ds.num_samples

integer

Number of samples to use from the training dataset, -1 mean all

validation_ds.text_file

string

Name of the text file for evaluation, located at data_dir

validation_ds.labels_file

string

Name of the labels dev file located at data_dir

validation_ds.num_samples

integer

Number of samples to use from the dev set, -1 - to use all

See also Model NLP.

Example of the command for training the model:

python token_classification_train.py \
       model.dataset.data_dir=<PATH_TO_DATA_DIR>  \
       trainer.max_epochs=<NUM_EPOCHS> \
       trainer.gpus=[<CHANGE_TO_GPU(s)_YOU_WANT_TO_USE>]

Required Arguments for Training

  • model.dataset.data_dir: Path to the directory with pre-processed data.

Note

While the arguments are defined in the spec file, if you wish to override these parameter definitions in the spec file and experiment with them, you may do so over command line by simple defining the param. For example, the sample spec file mentioned above has validation_ds.batch_size set to 64. However, if you see that the GPU utilization can be optimized further by using larger a batch size, you may override to the desired value, by adding the field validation_ds.batch_size=128 over command line. You may repeat this with any of the parameters defined in the sample spec file.

Inference

An example script on how to run inference on a few examples, could be found at examples/nlp/token_classification/token_classification_evaluate.py.

To run inference with the pre-trained model on a few examples, run:

python token_classification_evaluate.py \
       pretrained_model=<PRETRAINED_MODEL>

Required Arguments for inference

  • pretrained_model: pretrained TokenClassification model from list_available_models() or path to a .nemo file, for example: ner_en_bert or your_model.nemo

Model Evaluation

An example script on how to evaluate the pre-trained model, could be found at examples/nlp/token_classification/token_classification_evaluate.py.

To run evaluation of the pre-trained model, run:

python token_classification_evaluate.py \
       model.dataset.data_dir=<PATH/TO/DATA/DIR>  \
       pretrained_model=ner_en_bert \
       model.test_ds.text_file=<text_*.txt> \
       model.test_ds.labels_file=<labels_*.txt> \
       model.dataset.max_seq_length=512

Required Arguments

  • pretrained_model: pretrained TokenClassification model from list_available_models() or path to a .nemo file, for example: ner_en_bert or your_model.nemo

  • model.dataset.data_dir: Path to the directory that containes model.test_ds.text_file and model.test_ds.labels_file.

During evaluation of the test_ds, the script generates a classification reports that includes the following metrics:

  • Precision

  • Recall

  • F1

More details about these metrics could be found here.

References

NLP-NER1

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.