BERT Embedding Models

User Guide (Latest Version)

Sentence-BERT (SBERT) is a modification of the Bidirectional Encoder Representations from Transformers (BERT) model that is specifically trained to generate semantically meaningful sentence embeddings. The model architecture and pre-training process are detailed in the Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks paper. Similar to BERT, SBERT utilizes a BERT-based architecture, but it is trained using a Siamese and triplet network structure to derive fixed-sized sentence embeddings that capture semantic information. SBERT is commonly used to generate high-quality sentence embeddings for various downstream natural language processing tasks, such as semantic textual similarity, clustering, and information retrieval.

The fine-tuning data for the SBERT model should consist of data instances, each comprising a query, a positive document, and a list of negative documents. Negative mining is not supported in NeMo yet; therefore, data preprocessing should be performed offline before training. The dataset should be in JSON format. For instance, the dataset should have the following structure:

Copy
Copied!
            

[ { "query": "Query", "pos_doc": "Positive", "neg_doc": ["Negative_1", "Negative_2", ..., "Negative_n"] }, { // Next data instance }, ..., { // Subsequent data instance } ]

This format ensures that the fine-tuning data is appropriately structured for training the SBERT model.

To fine-tune the SBERT model, you must initialize it with a BERT model checkpoint. You have two options for obtaining this checkpoint:

  1. If you already have a .nemo checkpoint for SBERT, you can use it directly.

  2. If you have a Hugging Face BERT checkpoint, you’ll need to convert it to the NeMo Megatron Core (mcore) format. Follow the steps below:

Copy
Copied!
            

python NeMo/scripts/nlp_language_modeling/convert_bert_hf_to_nemo.py \ --input_name_or_path "intfloat/e5-large-unsupervised" \ --output_path /path/to/output/nemo/file.nemo \ --mcore True \ --precision 32

You must set the configuration to be used for the fine-tuning pipeline in conf/config.yaml.

  1. Set the fine_tuning configuration to specify the file to be used for training purposes. You must include fine_tuning in stages to run the training pipeline.

  2. Set the fine_tuning configuration to bert_embedding/sft for BERT Embedding models.

  3. Update the configuration to adjust the hyperparameters of the training runs.

Configure the Slurm Cluster

  1. Set the configuration for your Slurm cluster in conf/cluster/bcm.yaml:

Copy
Copied!
            

partition: null account: null exclusive: True gpus_per_task: null gpus_per_node: 8 mem: 0 overcommit: False job_name_prefix: "nemo-megatron-"

2. Set the job-specific training configurations in the run section of conf/fine_tuning/bert_embedding/sft.yaml:

Copy
Copied!
            

run: name: bertembedding results_dir: ${base_results_dir}/${.name} time_limit: "4:00:00" dependency: "singleton"

  1. To run only the fine-tuning pipeline, set the stages section of conf/config.yaml to:

Copy
Copied!
            

stages: - fine_tuning

  1. Enter the following command:

Copy
Copied!
            

python3 main.py

Configure the Base Command Platform

  1. Select the cluster-related configuration by following the information in the NVIDIA Base Command Platform documentation.

  2. Launch the job and override the training job values of any configurations that need to be updated. Enter the following command:

Copy
Copied!
            

python3 main.py

Previous Embedding Models
Next Multimodal Models
© | | | | | | |. Last updated on Jun 19, 2024.