Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Token Classification Model with Named Entity Recognition (NER)#

The token classification model supports NER and other token-level classification tasks, as long as the data follows the format specified below.

We’re going to use NER task throughout this section. NER, also referred to as entity chunking, identification, or extraction, is the task of detecting and classifying key information (entities) in text. In other words, a NER model takes a piece of text as input and then determines the category for each word within it. For example, in the sentence “Mary lives in Santa Clara and works at NVIDIA,” the model should detect that “Mary” is a person, “Santa Clara” is a location, and “NVIDIA” is a company.

Quick Start#

  1. To run token-level classification, use the following Python script:

from nemo.collections.nlp.models import TokenClassificationModel

# to get the list of pre-trained models
TokenClassificationModel.list_available_models()

# Download and load the pre-trained BERT-based model
model = TokenClassificationModel.from_pretrained("ner_en_bert")

# try the model on a few examples
model.add_predictions(['we bought four shirts from the nvidia gear store in santa clara.', 'NVIDIA is a company.'])
  1. Try this model in a Jupyter notebook, which you can run on Google’s Colab. You can find this script in the

    NeMo tutorial.

  2. Connect to an instance with a GPU (Runtime -> Change runtime type -> select GPU for the hardware accelerator).

You can find example scripts and configuration files for the token classification model at the following locations:

Provide Data Input for the Token Classification Model#

To pre-train or fine-tune the model, split the data into the following two files:

  • text.txt

  • labels.txt

Each line of the text.txt file contains text sequences, where words are separated with spaces, i.e.: [WORD] [SPACE] [WORD] [SPACE] [WORD]. The labels.txt file contains corresponding labels for each word in text.txt, the labels are separated with spaces, i.e.: [LABEL] [SPACE] [LABEL] [SPACE] [LABEL]. The following is an example of a text.txt file:

Jennifer is from New York City . She likes … …

The following is an example of the corresponding labels.txt file:

B-PER O O B-LOC I-LOC I-LOC O O O … …

Convert the Dataset#

To convert the IOB tagging format https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging) (short for inside, outside, beginning) into the format required for training, use the NeMo import script.

# For conversion from IOB format, for example, for CoNLL-2003 dataset:
python import_from_iob_format.py --data_file=<PATH/TO/THE/FILE/IN/IOB/FORMAT>

Required Arguments for Dataset Conversion#

  • --data_file: path to the file to convert from IOB to NeMo format

After running the above command, the data directory containing the --data_file should include the text_*.txt and labels_*.txt files. The default names for the training and evaluation in the conf/token_classification_config.yaml are the following:

.
|--data_dir
  |-- labels_dev.txt
  |-- labels_train.txt
  |-- text_dev.txt
  |-- text_train.txt

Train the Token Classification Model#

In the token classification model, we are jointly training a classifier on top of a pre-trained language model, such as BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [NLP-NER21]. Unless the user provides a pre-trained checkpoint for the language model, the language model is initialized with the pre-trained model from Hugging Face Transformers.

An example of model configuration file for training the model can be found at NeMo configuration file.

The specification can be roughly grouped into three categories:

  • Parameters that describe the training process: trainer

  • Parameters that describe the datasets: model.dataset, model.train_ds, model.validation_ds

  • Parameters that describe the model: model

You can find more details about the spec file parameters in table below.

Parameter

Data Type

Description

model.dataset.data_dir | string

Path to the data converted to the specified above format.

model.head.num_fc_layers | integer

Number of fully connected layers.

model.head.fc_dropout | float

Dropout to apply to the input hidden states.

model.head.activation | string

Activation to use between fully connected layers.

model.punct_head.use_transrormer_init | bool

Whether to initialize the weights of the classifier head with the same approach used in Transformer.

training_ds.text_file | string

Name of the text training file located at data_dir.

training_ds.labels_file | string

Name of the labels training file located at data_dir.

training_ds.num_samples | integer

Number of samples to use from the training dataset, -1 - to use all.

validation_ds.text_file | string

Name of the text file for evaluation, located at data_dir.

validation_ds.labels_file | string

Name of the labels dev file located at data_dir.

validation_ds.num_samples | integer

Number of samples to use from the dev set, -1 - to use all.

For more information, see Model NLP.

Here is an example command for training the model:

python token_classification_train.py \
       model.dataset.data_dir=<PATH_TO_DATA_DIR>  \
       trainer.max_epochs=<NUM_EPOCHS> \
       trainer.devices=[<CHANGE_TO_GPU(s)_YOU_WANT_TO_USE>] \
       trainer.accelerator='gpu'

Required Arguments for Training#

The following argument is required for training:

  • model.dataset.data_dir: path to the directory with pre-processed data.

Note

While the arguments are defined in the spec file, you can override these parameter definitions and experiment with them using the command line. For example, the sample spec file mentioned above has validation_ds.batch_size set to 64. However, if the GPU utilization can be optimized further by using a larger batch size, you can override it to the desired value by adding the field validation_ds.batch_size=128 from the command-line. You can repeat this process with any of the parameters defined in the sample spec file.

Inference#

An example script on how to run inference can be found at NeMo evaluation script.

To run inference with the pre-trained model, run:

python token_classification_evaluate.py \
       pretrained_model=<PRETRAINED_MODEL>

Required Arguments for Inference#

The following argument is required for inference:

  • pretrained_model: pretrained Token Classification model from list_available_models() or path to a .nemo file. For example, ner_en_bert or your_model.nemo

Evaluate the Token Classification Model#

An example script on how to evaluate the pre-trained model can be found at NeMo evaluation script.

To start the evaluation of the pre-trained mode, run:

python token_classification_evaluate.py \
       model.dataset.data_dir=<PATH/TO/DATA/DIR>  \
       pretrained_model=ner_en_bert \
       model.test_ds.text_file=<text_*.txt> \
       model.test_ds.labels_file=<labels_*.txt> \
       model.dataset.max_seq_length=512

Required Arguments for Evaluation#

The following arguments are required for evaluation:

  • pretrained_model: pretrained Token Classification model from list_available_models() or path to a .nemo file. For example, ner_en_bert or your_model.nemo

  • model.dataset.data_dir: path to the directory that containes model.test_ds.text_file and model.test_ds.labels_file

During evaluation of the test_ds, the script generates a classification report that includes the following metrics:

  • Precision

  • Recall

  • F1

For more information, see Wikipedia.

References#

[NLP-NER21]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.