.. _punctuation_and_capitalization: Punctuation and Capitalization Model ==================================== Automatic Speech Recognition (ASR) systems typically generate text with no punctuation and capitalization of the words. There are two issues with non-punctuated ASR output: - it could be difficult to read and understand - models for some downstream tasks, such as named entity recognition, machine translation, or text-to-speech, are usually trained on punctuated datasets and using raw ASR output as the input to these models could deteriorate their performance Quick Start Guide ----------------- .. code-block:: python from nemo.collections.nlp.models import PunctuationCapitalizationModel # to get the list of pre-trained models PunctuationCapitalizationModel.list_available_models() # Download and load the pre-trained BERT-based model model = PunctuationCapitalizationModel.from_pretrained("punctuation_en_bert") # try the model on a few examples model.add_punctuation_capitalization(['how are you', 'great how about you']) Model Description ----------------- For each word in the input text, the Punctuation and Capitalization model: - predicts a punctuation mark that should follow the word (if any). By default, the model supports commas, periods, and question marks. - predicts if the word should be capitalized or not In the Punctuation and Capitalization model, we are jointly training two token-level classifiers on top of a pre-trained language model, such as `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding `__ :cite:`nlp-punct-devlin2018bert`. .. note:: We recommend you try this model in a Jupyter notebook (run on `Google's Colab `_.): `NeMo/tutorials/nlp/Punctuation_and_Capitalization.ipynb `__. Connect to an instance with a GPU (**Runtime** -> **Change runtime type** -> select **GPU** for the hardware accelerator). An example script on how to train and evaluate the model can be found at: `NeMo/examples/nlp/token_classification/punctuation_capitalization_train_evaluate.py `__. The default configuration file for the model can be found at: `NeMo/examples/nlp/token_classification/conf/punctuation_capitalization_config.yaml `__. The script for inference can be found at: `NeMo/examples/nlp/token_classification/punctuate_capitalize_infer.py `__. .. _raw_data_format_punct: Raw Data Format --------------- The Punctuation and Capitalization model can work with any text dataset, although it is recommended to balance the data, especially for the punctuation task. Before pre-processing the data to the format expected by the model, the data should be split into ``train.txt`` and ``dev.txt`` (and optionally ``test.txt``). Each line in the ``train.txt/dev.txt/test.txt`` should represent one or more full and/or truncated sentences. Example of the ``train.txt``/``dev.txt`` file: .. code:: When is the next flight to New York? The next flight is ... .... The ``source_data_dir`` structure should look similar to the following: .. code:: . |--sourced_data_dir |-- dev.txt |-- train.txt .. _nemo-data-format-label: NeMo Data Format ---------------- The Punctuation and Capitalization model expects the data in the following format: The training and evaluation data is divided into 2 files: - ``text.txt`` - ``labels.txt`` Each line of the ``text.txt`` file contains text sequences, where words are separated with spaces. [WORD] [SPACE] [WORD] [SPACE] [WORD], for example: :: when is the next flight to new york the next flight is ... ... The ``labels.txt`` file contains corresponding labels for each word in ``text.txt``, the labels are separated with spaces. Each label in ``labels.txt`` file consists of 2 symbols: - the first symbol of the label indicates what punctuation mark should follow the word (where ``O`` means no punctuation needed) - the second symbol determines if a word needs to be capitalized or not (where ``U`` indicates that the word should be upper cased, and ``O`` - no capitalization needed) By default, the following punctuation marks are considered: commas, periods, and question marks; the remaining punctuation marks were removed from the data. This can be changed by introducing new labels in the ``labels.txt`` files. Each line of the ``labels.txt`` should follow the format: ``[LABEL] [SPACE] [LABEL] [SPACE] [LABEL]`` (for ``labels.txt``). For example, labels for the above ``text.txt`` file should be: :: OU OO OO OO OO OO OU ?U OU OO OO OO ... ... The complete list of all possible labels used in this tutorial are: - ``OO`` - ``.O`` - ``?O`` - ``OU`` - - ``.U`` - ``?U`` Converting Raw Data to NeMo Format ---------------------------------- To pre-process the raw text data, stored under :code:`sourced_data_dir` (see the :ref:`raw_data_format_punct` section), run the following command: .. code:: python examples/nlp/token_classification/data/prepare_data_for_punctuation_capitalization.py \ -s \ -o Required Argument for Dataset Conversion ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - :code:`-s` or :code:`--source_file`: path to the raw file - :code:`-o` or :code:`--output_dir` - path to the directory to store the converted files After the conversion, the :code:`output_dir` should contain :code:`labels_*.txt` and :code:`text_*.txt` files. The default names for the training and evaluation in the :code:`conf/punctuation_capitalization_config.yaml` are the following: .. code:: . |--output_dir |-- labels_dev.txt |-- labels_train.txt |-- text_dev.txt |-- text_train.txt Tarred dataset -------------- Tokenization and encoding of data is quite costly for punctuation and capitalization task. If your dataset contains a lot of samples (~4M) you may use tarred dataset. A tarred dataset is a collection of `.tar` files which contain batches ready for passing into a model. Tarred dataset is not loaded into memory entirely, but in small pieces, which do not overflow memory. Tarred dataset relies on `webdataset `_. For creating of tarred dataset you will need data in NeMo format: .. code:: python examples/nlp/token_classification/data/create_punctuation_capitalization_tarred_dataset.py \ --text \ --labels \ --output_dir \ --num_batches_per_tarfile 100 All tar files contain similar amount of batches, so up to :code:`--num_batches_per_tarfile - 1` batches will be discarded during tarred dataset creation. Beside `.tar` files with batches, the `examples/nlp/token_classification/data/create_punctuation_capitalization_tarred_dataset.py `_ script will create metadata JSON file, and 2 `.csv` files with punctuation and capitalization label vocabularies. To use tarred dataset you will need to pass path to a metadata file of your dataset in a config parameter :code:`model.train_ds.tar_metadata_file` and set a config parameter :code:`model.train_ds.use_tarred_dataset=true`. Training Punctuation and Capitalization Model --------------------------------------------- The language model is initialized with the a pre-trained model from `HuggingFace Transformers `__, unless the user provides a pre-trained checkpoint for the language model. To train model from scratch, you will need to provide HuggingFace configuration in one of parameters ``model.language_model.config_file``, ``model.language_model.config``. An example of a model configuration file for training the model can be found at: `NeMo/examples/nlp/token_classification/conf/punctuation_capitalization_config.yaml `__. A configuration file is a `.yaml` file which contains all parameters for model creation, training, testing, validation. A structure of the configuration file for training and testing is described in the :ref:`Run config` section. Some of parameters are required in a punctuation-and-capitalization `.yaml` config. Default values of required parameters are ``???``. If you omit any of other parameters, they will be initialized according to default values from following tables. .. _run-config-label: Run config ^^^^^^^^^^ An example of a config file is `here `_. .. list-table:: Run config. The main config passed to a script `punctuation_capitalization_train_evaluate.py `_ :widths: 5 5 10 25 :header-rows: 1 * - **Parameter** - **Data type** - **Default value** - **Description** * - **pretrained_model** - string - ``null`` - Can be an NVIDIA's NGC cloud model or a path to a ``.nemo`` checkpoint. You can get list of possible cloud options by calling a method :py:meth:`~nemo.collections.nlp.models.PunctuationCapitalizationModel.list_available_models`. * - **name** - string - ``'Punctuation_and_Capitalization'`` - A name of the model. Used for naming output directories and ``.nemo`` checkpoints. * - **do_training** - bool - ``true`` - Whether to perform training of the model. * - **do_testing** - bool - ``false`` - Whether ot perform testing of the model after training. * - **model** - :ref:`model config` - :ref:`model config` - A configuration for the :class:`~nemo.collections.nlp.models.PunctuationCapitalizationModel`. * - **trainer** - trainer config - - Parameters of `pytorch_lightning.Trainer `_. * - **exp_manager** - exp manager config - - A configuration with various NeMo training options such as output directories, resuming from checkpoint, tensorboard and W&B logging, and so on. For possible options see :ref:`exp-manager-label` description and class :class:`~nemo.utils.exp_manager.exp_manager`. .. _model-config-label: Model config ^^^^^^^^^^^^ .. list-table:: Location of model config in parent config :widths: 5 5 :header-rows: 1 * - **Parent config** - **Key in parent config** * - :ref:`Run config` - ``model`` A configuration of :class:`~nemo.collections.nlp.models.token_classification.punctuation_capitalization_model.PunctuationCapitalizationModel` model. .. list-table:: Model config :widths: 5 5 10 25 :header-rows: 1 * - **Parameter** - **Data type** - **Default value** - **Description** * - **class_labels** - :ref:`class labels config` - :ref:`class labels config` - Cannot be omitted in `.yaml` config. The ``class_labels`` parameter containing a dictionary with names of label id files used in ``.nemo`` checkpoints. These file names can also be used for passing label vocabularies to the model. If you wish to use ``class_labels`` for passing vocabularies, please provide path to vocabulary files in ``model.common_dataset_parameters.label_vocab_dir`` parameter. * - **common_dataset_parameters** - :ref:`common dataset parameters config` - :ref:`common dataset parameters config` - Label ids and loss mask information. * - **train_ds** - :ref:`data config` with string in ``ds_item`` - ``null`` - A configuration for creating training dataset and data loader. Cannot be omitted in `.yaml` config if training is performed. * - **validation_ds** - :ref:`data config` with string OR list of strings in ``ds_item`` - ``null`` - A configuration for creating validation datasets and data loaders. * - **test_ds** - :ref:`data config` with string OR list of strings in ``ds_item`` - ``null`` - A configuration for creating test datasets and data loaders. Cannot be omitted in `.yaml` config if testing is performed. * - **punct_head** - :ref:`head config` - :ref:`head config` - A configuration for creating punctuation MLP head that is applied to a language model outputs. * - **capit_head** - :ref:`head config` - :ref:`head config` - A configuration for creating capitalization MLP head that is applied to a language model outputs. * - **tokenizer** - :ref:`tokenizer config` - :ref:`tokenizer config` - A configuration for creating source text tokenizer. * - **language_model** - :ref:`language model config` - :ref:`language model config` - A configuration of a BERT-like language model which serves as a model body. * - **optim** - optimization config - ``null`` - A configuration of optimizer, learning rate scheduler, and L2 regularization. Cannot be omitted in `.yaml` config if training is performed. For more information see :ref:`Optimization ` and `primer `_ tutorial. .. _class-labels-config-label: Class labels config ^^^^^^^^^^^^^^^^^^^ .. list-table:: Location of class labels config in parent configs :widths: 5 5 :header-rows: 1 * - **Parent config** - **Key in parent config** * - :ref:`Run config` - ``model.class_labels`` * - :ref:`Model config` - ``class_labels`` .. list-table:: Class labels config :widths: 5 5 5 35 :header-rows: 1 * - **Parameter** - **Data type** - **Default value** - **Description** * - **punct_labels_file** - string - ??? - A name of a punctuation labels file. This parameter cannot be omitted in `.yaml` config. This name is used as a name of label ids file in ``.nemo`` checkpoint. It also can be used for passing label vocabulary to the model. If ``punct_labels_file`` is used as a vocabulary file, then you should provide parameter ``label_vocab_dir`` in :ref:`common dataset parameters` (``model.common_dataset_parameters.label_vocab_dir`` in :ref:`run config`). Each line of ``punct_labels_file`` file contains 1 label. The values are sorted, ``==