Reinforcement Learning from Human Feedback

NeMo-RLHF is a library for fine-tuning LLMs, using Reinforcement Learning from Human Feedback (RLHF) in a fully distributed manner.

NeMo-RLHF supports only GPT models and implements the Proximal Policy Optimization (PPO) algorithm. Support for other models and RL algorithms is planned for future releases.

NeMo-RLHF currently is not integrated into NeMo-Megatron-Launcher, so you must launch RLHF jobs directly from the NeMo-RLHF repository in /opt/nemo-rlhf.

For the sections below, NVIDIA provides configurations to try NeMo-RLHF using the newly released 2B GPT model with 4096 sequence length. For the dataset, we recommend Anthropic HH-RLHF dataset to start.

NOTE: For best RLHF training performance, it is recommended to start with a supervised fine tuned model rather than the base model. Please see above sections for doing SFT on the base model.

You can use NeMo-RLHF to train your own reward model. You train the reward model using a pairwise comparison loss, so you need a dataset with response pairs, where one response in the pair is ranked better than the other. A good reward model is crucial to the success of PPO training in the next stage.

Data Preprocessing

With your own or publicly available data, start by processing the data into JSONL format. Then use the script preprocess_data_for_megatron.py to convert the JSONL file into NeMo format. Format your pairwise comparison dataset with the following structure:

Copy
Copied!
            

{"text": prompt1 || good_response_1} {"text": prompt1 || bad_response_1} {"text": prompt2 || good_response_2} {"text": prompt2 || bad_response_2} ...

where || denotes string concatenation and prompt1 and prompt2 are different prompts. Note that for the same prompt, prompt || good_response must come before prompt || bad_response in the dataset.

An example JSONL file can look like the following:

Copy
Copied!
            

{"text": User: When did Virgin Australia start operating?\nAssistant: 31 August 2000} {"text": User: When did Virgin Australia start operating?\nAssistant: I refuse to answer this question.} {"text": User: What is 6*10?\nAssistant: 60} {"text": User: What is 6*10?\nAssistant: 90} ...

For reference, the following command for preprocessing the dataset:

Copy
Copied!
            

python3 /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \ --input "test.jsonl" \ --output-prefix "./output" \ --tokenizer-model sp_tokenizer_256k.model \ --tokenizer-library=sentencepiece \ --json-keys text \ --dataset-impl mmap \ --workers 30 \ --chunk_size=100 \ --append-eod

This command generates the files output_document.bin and output_document.idx, which you can use for reward model training.

Reward Model Training

To launch reward model training, you must start with a pretrained or fine-tuned NeMo checkpoint. The file training_rm.yaml has default configurations for the 2B model, but you may use any model you like.

This an example command line for begin training:

Copy
Copied!
            

cd /opt/nemo-rlhf \ && export PYTHONPATH="/opt/nemo-rlhf:${PYTHONPATH}" \ && python -u examples/nlp/gpt/train_reward_model.py \ --config-path=examples/nlp/gpt/conf \ --config-name=training_rm \ model.pretrained_checkpoint.restore_from_path='model.nemo' \ "model.data.data_prefix={train: [${train_output_document}], validation: [${val_output_document}], test: [${test_output_document}]}"

Reward Model Evaluation

To learn how to serve the reward model for evaluation, see the section Launching the Reward Model Inference Server, below.

After you have fine-tuned a GPT model using Supervised Fine-Tuning (SFT), and trained a reward model as explained in the preceding sections, you can launch PPO jobs with NeMo-RLHF to fine-tune the SFT model using RLHF.

During PPO training, four different models interact with each other:

  1. The PPO Actor Network, also known as the Policy Network or “actor”: This is the model you are training. It’s recommended to begin with a supervised fine tuned model for RLHF training.

  2. The Reward Model Network (RM Network), also known as a Preference Model (PM): This model takes a prompt and a response as inputs and provides a single scalar value as output. The scalar value is the reward, which the PPO algorithm try to maximize. The RM must be a model trained as described in the section Reward Model Training.

  3. The PPO Critic Network, also known as the Value Network or “critic”: Because PPO is an actor-critic algorithm, the actor needs a “critic” to help it learn more effectively. The critic provides value estimates for each token in the responses provided by the actor. You can think of these values as estimates of the amount of reward the actor will receive after generating all of the remaining tokens. The critic is loaded from the same RM you trained as described in Reward Model Training.

    Note

    The RM generates a single reward for the entire sequence, while the critic generates a value for each token.

  1. The Initial Policy Network, also known as the Reference/Anchor Model: The PPO algorithm uses this model to compute a Kullback-Leibler divergence (KL divergence) penalty term, which ensures that the actor does not diverge too much from where it started. This prevents the actor from overfitting to the reward models given by the RM, and ensures that it does not forget the knowledge it acquired during pretraining and SFT. This model is the same model as the actor at initialization and is never updated.

To launch a full PPO training job, you must launch the RM and the Initial Policy as inference servers. These two models are not trained, so they only need to perform inference and share their results with the actor. However, the actor and critic need to be trained.

The NeMo PPO architecture assumes that all four models are launched separately. Therefore, you must launch:

  • An inference server for the RM

  • A separate inference server for the initial policy

  • A server that can do inference and training (the PPO Critic)

  • A master job to do training (the PPO Actor).

The next section discusses how to launch each of these four jobs.

Launching the Reward Model Inference Server

To launch the Reward Model inference server in a Linux system, enter this command in the NeMo training container:

Copy
Copied!
            

cd /opt/nemo-rlhf \ && export PYTHONPATH="/opt/nemo-rlhf:${PYTHONPATH}" \ && export HYDRA_FULL_ERROR=1 \ && python -u examples/nlp/gpt/serve_reward_model.py \ --config-path=examples/nlp/gpt/conf/ \ --config-name=inference_rm \ gpt_rm_model_file=/path/to/model.nemo \ port=5555

This command launches the RM inference server on the local computer, using port 5555. You can modify all of the configuration parameters in the file inference_rm.yaml, or override them on the command line. Ensure that server is set to True in the configuration for this job to launch the inference server correctly.

Note

Data parallelism is not available for the inference servers, so only a single copy of the model is available.

Launching the Initial Policy Inference Server

To launch the Initial Policy inference server in a Linux system, enter this command in the container:

Copy
Copied!
            

cd /opt/nemo-rlhf \ && export PYTHONPATH="/opt/nemo-rlhf:${PYTHONPATH}" \ && export HYDRA_FULL_ERROR=1 \ && python -u examples/nlp/gpt/serve_initial_policy.py \ --config-path=examples/nlp/gpt/conf/ \ --config-name=inference_initial_policy \ gpt_model_file=/path/to/model.nemo \ port=5556

This command launches the Initial Policy inference server on the local computer, using port 5556. You can modify all of the configurations in inference_initial_policy.yaml, or override them on the command line. Ensure that server is set to True in the configuration for this job to launch the inference server correctly.

Note

Data parallelism is not available for the inference servers, so only a single copy of the model is available.

Launching the PPO Critic Training and Inference Server

The PPO Critic must perform both training and inference, and is designed to have both capabilities. To launch the PPO Critic server in a Linux system, enter this command in the container:

Copy
Copied!
            

cd /opt/nemo-rlhf \ && export PYTHONPATH="/opt/nemo-rlhf:${PYTHONPATH}" \ && export HYDRA_FULL_ERROR=1 \ && python -u examples/nlp/gpt/serve_ppo_critic.py \ --config-path=examples/nlp/gpt/conf/ \ --config-name=gpt_ppo_critic \ model.pretrained_checkpoint.restore_from_path=/path/to/trained_rm.nemo \ port=5557

This command launches the PPO Critic server on the local computer, using port 5557. You can modify all of the configurations in gpt_ppo_critic.yaml, or override them on the command line. Ensure that inference.server is set to True in this job’s configuration to launch the server correctly.

Note

Data parallelism is not available for the servers, so only a single copy of the model is available.

Launching the PPO Actor Training

The PPO Actor training job contains the master HTTP controller that makes HTTP calls to all three servers when needed. To launch the PPO Actor server in a Linux system, enter this command in the container:

Copy
Copied!
            

cd /opt/nemo-rlhf \ && export PYTHONPATH="/opt/nemo-rlhf:${PYTHONPATH}" \ && export HYDRA_FULL_ERROR=1 \ && python -u examples/nlp/gpt/train_gpt_ppo_actor.py \ --config-path=examples/nlp/gpt/conf \ --config-name=gpt_ppo_actor \ "model.data.data_prefix={train: [/path/to/train_data], validation: [/path/to/val_data], test: [/path/to/test_data]}" \ model.pretrained_checkpoint.restore_from_path=/path/to/model.nemo

This command launches the PPO Actor job on the local computer. You can modify all of the configurations in gpt_ppo_actor.yaml , or override them on the command line.

Launching Every Job at Once with SLURM

You can use heterogeneous jobs to launch all four jobs simultaneously in different nodes, using a script like this one:

Copy
Copied!
            

#!/bin/bash #SBATCH -N 1 --ntasks-per-node 1 -t 4:00:00 --exclusive #SBATCH hetjob #SBATCH -N 1 --ntasks-per-node -t 4:00:00 --exclusive #SBATCH hetjob #SBATCH -N 1 --ntasks-per-node 1 -t 4:00:00 --exclusive #SBATCH hetjob #SBATCH -N 8 --ntasks-per-node 8 -t 4:00:00 --exclusive RM_MODEL=/path/to/reward_model.nemo ACTOR_MODEL=/path/to/sft_model.nemo DIR=/opt/nemo-rlhf CONTAINER=<<Latest-NeMo-Training-Container>> # START HETEROGENOUS JOB 0 read -r -d '' cmd_rm_inference <<EOF cd ${DIR} \ && export PYTHONPATH="${DIR}:${PYTHONPATH}" \ && export HYDRA_FULL_ERROR=1 \ && python -u examples/nlp/gpt/serve_reward_model.py \ --config-path=examples/nlp/gpt/conf/ \ --config-name=inference_rm \ gpt_rm_model_file=${RM_MODEL} \ port=${RM_PORT=5555} EOF srun --het-group=0 --container-image=${CONTAINER} bash -c "${cmd_rm_inference}" & # END HETEROGENOUS JOB 0 #################################################### # START HETEROGENOUS JOB 1 read -r -d '' cmd_init_policy_inference <<EOF cd ${DIR} \ && export PYTHONPATH="${DIR}:${PYTHONPATH}" \ && export HYDRA_FULL_ERROR=1 \ && python -u examples/nlp/gpt/serve_initial_policy.py \ --config-path=examples/nlp/gpt/conf/ \ --config-name=inference_initial_policy \ gpt_model_file=${ACTOR_MODEL} \ port=${INIT_POLICY_PORT=5556} EOF srun --het-group=1 -o $INIT_POLICY_OUTFILE -e $INIT_POLICY_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_init_policy_inference}" & # END HETEROGENOUS JOB 1 sleep 30 ###################################################### # START HETEROGENOUS JOB 2 read -r -d '' cmd_critic_inference <<EOF cd ${DIR} \ && export PYTHONPATH="${DIR}:${PYTHONPATH}" \ && export HYDRA_FULL_ERROR=1 \ && python -u examples/nlp/gpt/serve_ppo_critic.py \ --config-path=examples/nlp/gpt/conf/ \ --config-name=gpt_ppo_critic \ model.pretrained_checkpoint.restore_from_path=${RM_MODEL} \ inference.port=${CRITIC_PORT=5557} EOF srun --het-group=2 --container-image=${CONTAINER} bash -c "${cmd_critic_inference}" & # END HETEROGENOUS JOB 2 sleep 30 #################################################### # START HETEROGENOUS JOB 3 TRAIN_DATA_PATH=/path/to/train_data VALID_DATA_PATH=/path/to/val_data TEST_DATA_PATH=/path/to/test_data read -r -d '' cmd_ppo <<EOF cd ${DIR} \ && export PYTHONPATH="${DIR}:${PYTHONPATH}" \ && export HYDRA_FULL_ERROR=1 \ && python -u examples/nlp/gpt/train_gpt_ppo_actor.py \ --config-path=examples/nlp/gpt/conf \ --config-name=gpt_ppo_actor \ trainer.num_nodes=8 \ "model.data.data_prefix={train: [${TRAIN_DATA_PATH}], validation: [${VALID_DATA_PATH}], test: [${TEST_DATA_PATH}]}" \ model.pretrained_checkpoint.restore_from_path=${ACTOR_MODEL} \ model.rlhf.reward_model.ip=${SLURM_JOB_NODELIST_HET_GROUP_0} \ model.rlhf.reward_model.port=${RM_PORT} \ model.rlhf.initial_policy.ip=${SLURM_JOB_NODELIST_HET_GROUP_1} \ model.rlhf.initial_policy.port=${INIT_POLICY_PORT} \ model.rlhf.critic.ip=${SLURM_JOB_NODELIST_HET_GROUP_2} \ model.rlhf.critic.port=${CRITIC_PORT} EOF srun --het-group=3 --container-image=${CONTAINER} bash -c "${cmd_ppo}" & # END HETEROGENOUS JOB 3 wait

It is important to launch each srun command with a trailing “&” to ensure that each one doesn’t block the next one. The wait statement at the end of the script ensures that the entire script does not exit until each command is finished.

PPO Hyper-Parameters

You can control all of the model-related parameters the same way as in other NeMo training jobs. However, you also can control the behavior of PPO during training by setting hyper-parameters in the model.rlhf section of the YAML configuration file located at /opt/nemo-rlhf/examples/nlp/gpt/conf/gpt_ppo_actor.yaml in the NeMo training container.

These are the available hyper-parameters:

  • rlhf.reward_model: Specifies the IP address and port that the Reward Model uses, enabling communication with it.

  • rlhf.critic: Specifies the IP address and port that the PPO Critic uses, enabling communication with it.

  • rlhf.initial_policy: Specifies the IP address and port that the Initial Policy uses, enabling communication with it.

  • rlhf.ppo.entropy_penalty: Controls the effect of the entropy term in PPO. This incentivizes the model to produce more diverse outputs.

  • rlhf.ppo.inital_pollicy_kl_penalty: Controls the effect of the initial policy KL divergence term in PPO. The controls how much the model diverges from it’s starting point.

  • rlhf.ppo.epochs: Specifies the number of epochs the actor and critic are to perform on the data stored in the rollout buffer each time.

  • rlhf.ppo.num_rollout_samples: Specifies the number of samples to be generated during the rollout stage before moving to the training stage.

  • rlhf.ppo.rollout_micro_batch_size: Specifies the micro batch size for the rollout phase. Each GPU loads this many prompts and generates responses for them.

  • rlhf.ppo.ratio_eps: Specifies the epsilon value for clipping the PPO ratio during training.

  • rlhf.ppo.discount: Specifies a discount factor when calculating returns and advantages.

  • rlhf.ppo.gae_lambda: Specifies a lambda value for the Generalized Advantage Estimation (GAE) calculation.

  • rlhf.ppo.normalize_advantage: A boolean that indicates whether to normalize the advantages to have a mean of zero and standard deviation of one.

During the rollout phase, you can also modify the sampling parameters for the model by using the parameters in model.sampling_params in /opt/nemo-rlhf/examples/nlp/gpt/conf/gpt_ppo_actor.yaml.

NVIDIA expects to make the following enhancements and improvements to the NeMo framework in future releases:

  • Add support for training with datasets that have more than two responses per prompt. (The reward model training only supports datasets with two responses per prompt.)

  • Greatly increase the throughput of PPO.

  • Further improve the stability of the PPO learning process.

© Copyright 2023, NVIDIA. Last updated on Nov 14, 2023.