Video2World LoRA Post-training with Cosmos Nemo Assets#
This guide provides instructions on running LoRA (Low-Rank Adaptation) post-training with the Cosmos-Predict2.5 Video2World 2B model.
What is LoRA?#
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that allows you to adapt large pre-trained models to specific domains or tasks by training only a small number of additional parameters.
Key Benefits of LoRA Post-Training#
Memory Efficiency: Only trains ~1-2% of total model parameters
Faster Training: Significantly reduced training time per iteration
Storage Efficiency: LoRA checkpoints are much smaller than full model checkpoints
Flexibility: Can maintain multiple LoRA adapters for different domains
Preserved Base Capabilities: Retains the original model’s capabilities while adding domain-specific improvements
When to Use LoRA vs Full Fine-tuning#
Use LoRA when:
You have limited compute resources
You want to create domain-specific adapters
You need to preserve the base model’s general capabilities
You’re working with smaller datasets
Use full fine-tuning when:
You need maximum model adaptation
You have sufficient compute and storage
You’re making fundamental changes to model behavior
Preparing Data#
Downloading Cosmos-NeMo-Assets#
The first step is downloading a dataset with videos.
You must provide a folder containing a collection of videos in MP4 format, preferably 720p. These videos should focus on the subject throughout the entire video so that each video chunk contains the subject. You can use nvidia/Cosmos-NeMo-Assets for post-training.
To download the dataset, please follow the following instructions:
mkdir -p datasets/cosmos_nemo_assets/
# This command will download the videos for physical AI
hf download nvidia/Cosmos-NeMo-Assets \
--repo-type dataset \
--local-dir datasets/cosmos_nemo_assets/ \
--include "*.mp4*"
mv datasets/cosmos_nemo_assets/nemo_diffusion_example_data datasets/cosmos_nemo_assets/videos
Preprocessing the Data#
Cosmos-NeMo-Assets comes with a single caption for 4 long videos.
Creating Prompt Files#
To generate text prompt files for each video in the dataset, use the provided preprocessing script:
# Create prompt files for all videos with a custom prompt
python -m scripts.create_prompts_for_nemo_assets \
--dataset_path datasets/cosmos_nemo_assets \
--prompt "A video of sks teal robot."
Dataset folder format:
datasets/cosmos_nemo_assets/
├── metas/
│ └── *.txt
└── videos/
└── *.mp4
LoRA Post-training#
Understanding the LoRA Configuration#
The LoRA configuration is defined in cosmos_predict2/experiments/base/cosmos_nemo_assets_lora.py
. Key LoRA parameters include:
model=dict(
config=dict(
use_lora=True,
lora_rank=32, # Rank of LoRA adaptation matrices
lora_alpha=32, # LoRA scaling parameter
lora_target_modules="q_proj,k_proj,v_proj,output_proj,mlp.layer1,mlp.layer2",
init_lora_weights=True, # Properly initialize LoRA weights
),
),
Running LoRA Post-training#
Run the following command to execute LoRA post-training with cosmos_nemo_assets
data:
torchrun --nproc_per_node=8 scripts/train.py \
--config=cosmos_predict2/_src/predict2/configs/video2world/config.py -- \
experiment=predict2_video2world_lora_training_2b_cosmos_nemo_assets
The model will be post-trained using the cosmos_nemo_assets dataset. Refer to the ../cosmos_predict2/experiments/base/cosmos_nemo_assets_lora.py
configto understand how the dataloader is defined.
# Cosmos-NeMo-Assets video2world dataset and dataloader
example_video_dataset_cosmos_nemo_assets = L(VideoDataset)(
dataset_dir="datasets/cosmos_nemo_assets",
num_frames=93,
video_size=(704, 1280),
)
dataloader_train_cosmos_nemo_assets_lora = L(get_generic_dataloader)(
dataset=example_video_dataset_cosmos_nemo_assets_lora,
sampler=L(get_sampler)(dataset=example_video_dataset_cosmos_nemo_assets_lora),
batch_size=1,
drop_last=True,
num_workers=4,
pin_memory=True,
)
Checkpoints are saved to ${IMAGINAIRE_OUTPUT_ROOT}/PROJECT/GROUP/NAME/checkpoints
. By default, IMAGINAIRE_OUTPUT_ROOT
is /tmp/imaginaire4-output
. We strongly recommend setting IMAGINAIRE_OUTPUT_ROOT
to a location with sufficient storage space for your checkpoints.
In the above example, PROJECT
is cosmos_predict_v2p5
, GROUP
is video2world_lora
, NAME
is 2b_cosmos_nemo_assets
.
See the job config to understand how they are determined.
predict2_video2world_training_2b_cosmos_nemo_assets = dict(
dict(
...
job=dict(
project="cosmos_predict_v2p5",
group="video2world_lora",
name="2b_cosmos_nemo_assets",
),
...
)
)
Inference with LoRA Post-trained checkpoint#
Converting DCP Checkpoint to Consolidated PyTorch Format#
Since the checkpoints are saved in DCP format during training, you need to convert them to consolidated PyTorch format (.pt) for inference. Use the convert_distcp_to_pt.py
script:
# Get path to the latest checkpoint
CHECKPOINTS_DIR=${IMAGINAIRE_OUTPUT_ROOT:-/tmp/imaginaire4-output}/cosmos_predict_v2p5/video2world_lora/2b_cosmos_nemo_assets/checkpoints
CHECKPOINT_ITER=$(cat $CHECKPOINTS_DIR/latest_checkpoint.txt)
CHECKPOINT_DIR=$CHECKPOINTS_DIR/$CHECKPOINT_ITER
# Convert DCP checkpoint to PyTorch format
python scripts/convert_distcp_to_pt.py $CHECKPOINT_DIR/model $CHECKPOINT_DIR
This conversion will create three files:
model.pt
: Full checkpoint containing both regular and EMA weightsmodel_ema_fp32.pt
: EMA weights only in float32 precisionmodel_ema_bf16.pt
: EMA weights only in bfloat16 precision (recommended for inference)
Running Inference#
After converting the checkpoint, you can run inference with your post-trained model using a JSON configuration file that specifies the inference parameters (see assets/video2world_cosmos_nemo_assets/nemo_image2world.json
for an example).
torchrun --nproc_per_node=8 examples/inference.py \
assets/video2world_cosmos_nemo_assets/nemo_image2world.json \
outputs/cosmos_nemo_posttraining \
--checkpoint-path $CHECKPOINT_DIR/model_ema_bf16.pt \
--experiment predict2_video2world_lora_training_2b_cosmos_nemo_assets
Generated videos will be saved to the output directory (e.g., outputs/cosmos_nemo_posttraining/
).