Sampling Control#
NVIDIA NIM for Cosmos WFM (World Foundation Models) exposes a suite of sampling parameters for fine-grained control over the generation behavior of the model. Below is a complete reference on how to configure the sampling parameters of an inference request.
Sampling Parameters#
Name |
Type |
Default |
Required |
Description |
|---|---|---|---|---|
prompt |
str |
— |
Yes |
The text prompt to use for output generation. Prompts up to 250 words in length are supported. |
negative_prompt |
str |
“The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality.” |
No |
A negative text prompt that specifies elements to avoid in the video generation. |
prompt_upsampling |
bool |
True |
No |
Whether to use prompt upsampling before generation to enhance prompt understanding. |
seed |
int |
None |
No |
The seed that governs generation. Changing the seed with other inputs fixed results in different outputs. Valid seed values are in the range [0, 4294967295]. |
guidance_scale |
float |
7.0 |
No |
The guidance (Classifier-Free Guidance) parameter that controls the balance between following the learned distribution of the model and amplifying the prompt influence during sample generation. Higher guidance values increase the weight of the prompt, resulting in sharper, more detailed, and highly aligned samples, but may reduce diversity and introduce artifacts. Lower guidance values produce more diverse samples closer to the learned distribution of the model, but may reduce alignment with the prompt. Adjusting this value allows you to fine-tune the trade-off between creativity and prompt adherence in the generated outputs. This value must be between 1.0 and 10.0. |
steps |
int |
35 |
No |
The number of diffusion sampling steps. Higher values generally produce better quality at the cost of longer generation time. This value must be in the range [1, 50]. |
video_params |
dict |
{ “height”: 704, “width”: 1280, “frames_count”: 121, “frames_per_sec”: 24 } |
No |
The resolution and timing parameters of the generated video. Supported resolutions include: 1:1 (960x960 pixels), 4:3 (960x704 pixels), 3:4 (704x960 pixels), 16:9 (1280x704 pixels), and 9:16 (704x1280 pixels). The frame rate can be adjusted within a range of 12 to 40 FPS. |
Name |
Type |
Default |
Required |
Description |
|---|---|---|---|---|
image |
str |
— |
Yes |
URL to image location OR base64 encoded image. If |
prompt |
str |
— |
No |
The text prompt to use for output generation. Prompts up to 250 words in length are supported. If text prompt is not provided, an image caption will be automatically generated and used as the text prompt. |
negative_prompt |
str |
“The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality.” |
No |
A negative text prompt that specifies elements to avoid in the video generation. |
seed |
int |
None |
No |
The seed that governs generation. Changing the seed with other inputs fixed results in different outputs. Valid seed values are in the range [0, 4294967295]. When not specified, a random seed is used. |
guidance_scale |
float |
7.0 |
No |
The guidance (Classifier-Free Guidance) parameter that controls the balance between following the learned distribution of the model and amplifying the prompt influence during sample generation. Higher guidance values increase the weight of the prompt, resulting in sharper, more detailed, and highly aligned samples, but may reduce diversity and introduce artifacts. Lower guidance values produce more diverse samples closer to the learned distribution of the model, but may reduce alignment with the prompt. Adjusting this value allows you to fine-tune the trade-off between creativity and prompt adherence in the generated outputs. This value must be between 1.0 and 10.0. |
steps |
int |
35 |
No |
The number of diffusion sampling steps. This value must be be in the range [1, 50]. |
video_params |
dict |
{ “height”: 704, “width”: 1280, “frames_count”: 121, “frames_per_sec”: 24 } |
No |
The resolution and timing parameters of the generated video. Supported resolutions include: 1:1 (960x960 pixels), 4:3 (960x704 pixels), 3:4 (704x960 pixels), 16:9 (1280x704 pixels), and 9:16 (704x1280 pixels). The frame rate can be adjusted within a range of 12 to 40 FPS. |
Name |
Type |
Default |
Required |
Description |
|---|---|---|---|---|
video |
str |
— |
Yes |
URL to video location OR base64 encoded video. If |
prompt |
str |
— |
No |
The text prompt to use for output generation. Prompts up to 250 words in length are supported. If text prompt is not provided, a video caption will be automatically generated and used as the text prompt. |
negative_prompt |
str |
“The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality.” |
No |
A negative text prompt that specifies elements to avoid in the video generation. |
seed |
int |
None |
No |
The seed that governs generation. Changing the seed with other inputs fixed results in different outputs. Valid seed values are in the range [0, 4294967295]. When not specified, a random seed is used. |
guidance_scale |
float |
7.0 |
No |
The guidance (Classifier-Free Guidance) parameter that controls the balance between following the learned distribution of the model and amplifying the prompt influence during sample generation. Higher guidance values increase the weight of the prompt, resulting in sharper, more detailed, and highly aligned samples, but may reduce diversity and introduce artifacts. Lower guidance values produce more diverse samples closer to the learned distribution of the model, but may reduce alignment with the prompt. Adjusting this value allows you to fine-tune the trade-off between creativity and prompt adherence in the generated outputs. This value must be between 1.0 and 10.0. |
steps |
int |
35 |
No |
The number of diffusion sampling steps. Higher values generally produce better quality at the cost of longer generation time. This value must be in the range [1, 50]. |
video_params |
dict |
{ “height”: 704, “width”: 1280, “frames_count”: 121, “frames_per_sec”: 24 } |
No |
The resolution and timing parameters of the generated video. Supported resolutions include: 1:1 (960x960 pixels), 4:3 (960x704 pixels), 3:4 (704x960 pixels), 16:9 (1280x704 pixels), and 9:16 (704x1280 pixels). The frame rate can be adjusted within a range of 12 to 40 FPS. |
Name |
Type |
Default |
Required |
Description |
|---|---|---|---|---|
video |
str |
— |
Yes |
Base64-encoded video or URL to the input video file. Supported container: MP4. Supported codecs: Raw (Uncompressed), VP9, VP8, H.264 (AVC), H.265 (HEVC), AV1, MPEG-1, MPEG-2, MPEG-4. Input video must have between 93 and 480 frames (num_frames = duration * frame_rate). |
prompt |
str |
— |
Yes |
Text prompt describing the desired output video style and content. |
negative_prompt |
str |
“The video captures a game playing, with bad crappy graphics and cartoonish frames. It represents a recording of old outdated games. The lighting looks very fake. The textures are very raw and basic. The geometries are very primitive. The images are very pixelated and of poor CG quality. There are many subtitles in the footage. Overall, the video is unrealistic at all.” |
No |
Negative prompt to specify undesired characteristics in the output video. |
image_context |
str |
None |
No |
b64 encoded or URL to an optional image context to condition the generation. |
seed |
int |
None |
No |
Random seed for reproducible generation. Valid seed values are in the range [0, 4294967295]. When not specified, a random seed is used. |
guidance |
int |
3 |
No |
Classifier-free guidance scale (0-7). Higher values increase prompt adherence. This value must be between 0 and 7. |
num_steps |
int |
35 |
No |
Number of diffusion sampling steps. More steps generally improve quality but increase runtime. This value must be at least 1. |
resolution |
str (“256”, “480”, “512”, or “720”) |
“480” |
No |
Processing resolution used internally by the model. Input videos of any resolution are automatically resampled to this resolution for processing. The output video resolution will match the input video resolution. |
num_conditional_frames |
int (0, 1, or 2) |
1 |
No |
Number of frames to use as conditioning from the input video. |
num_video_frames_per_chunk |
int |
93 |
No |
Number of frames to process per chunk for memory efficiency. Must be at least 1. |
sigma_max |
float |
None |
No |
Maximum noise level for diffusion process (advanced parameter). |
edge |
dict |
None |
At least one control type must be provided |
Edge detection control parameters. Contains: control (optional; b64 encoded or URL - if omitted, auto-generated from source video), mask (b64 encoded or URL to the mask for edge control), control_weight (0.0-1.0, default 1.0), preset_edge_threshold (one of: very_low, low, medium, high, very_high; default: medium). |
depth |
dict |
None |
At least one control type must be provided |
Depth estimation control parameters. Contains: control (optional; b64 encoded or URL - if omitted, auto-generated from source video), mask (b64 encoded or URL to the mask for depth control), control_weight (0.0-1.0, default 1.0). |
vis |
dict |
None |
At least one control type must be provided |
Visual/blur control parameters. Contains: control (optional; b64 encoded or URL - if omitted, auto-generated from source video), mask (b64 encoded or URL to the mask for visual control), control_weight (0.0-1.0, default 1.0), preset_blur_strength (one of: very_low, low, medium, high, very_high; default: medium). |
seg |
dict |
None |
At least one control type must be provided |
Segmentation control parameters. Contains: control (optional; b64 encoded or URL - if omitted, auto-generated from source video), mask (b64 encoded or URL to the mask for segmentation control), control_weight (0.0-1.0, default 1.0), control_prompt (text prompt for on-the-fly segmentation using SAM2+GroundingDINO; describes what objects to segment, e.g., ‘car building tree’; if not provided, uses first 128 words of main prompt as fallback; only used when control is not provided). |
Note
Understanding control_weight: The control_weight parameter (0.0-1.0) controls the strength of each control input’s influence on the generated video. Higher values (closer to 1.0) enforce stricter adherence to the control structure, ideal for precision applications like autonomous driving. Lower values (closer to 0.0) allow more creative freedom and deviation from the control input, suitable for artistic applications. When multiple controls are combined (e.g., edge=1.0, depth=0.5, seg=0.8), weights exceeding a sum of 1.0 are automatically scaled down proportionally to maintain balanced influence while preserving their relative ratios.
Note
Auto-generated vs. explicit control inputs: To have the NIM auto-generate control inputs from the source video, omit the control field entirely—do not set it to null or an empty string, as this will cause errors.
Auto-generation (control extracted from input video):
{"edge": {"control_weight": 1.0}}
Explicit control (provide your own control video):
{"edge": {"control_weight": 1.0, "control": "https://example.com/edge_video.mp4"}}
Tip
Timeout errors with URL-based videos: If you encounter timeout errors when providing video URLs (especially for larger files), use base64-encoded videos instead. URL downloads have an internal timeout, and large files may not download in time on the first request. Base64-encoded videos bypass this limitation. See the Using a Local Video File section for encoding examples.
Examples#
The following examples demonstrate how to use NIM for Cosmos WFM models via API calls. Note that generation typically takes several minutes depending on the hardware used.
curl -X 'POST' \
'http://0.0.0.0:8000/v1/infer' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "The teal robot is cooking food in a kitchen. Steam rises from a simmering pot as the robot chops vegetables on a worn wooden cutting board. Copper pans hang from an overhead rack, catching glints of afternoon light, while a well-loved cast iron skillet sits on the stovetop next to scattered measuring spoons and a half-empty bottle of olive oil.",
"negative_prompt": "blurry, low quality, artifacts, people",
"prompt_upsampling": true,
"seed": 42,
"guidance_scale": 7.5,
"steps": 35,
"video_params": {
"height": 704,
"width": 1280,
"frames_count": 121,
"frames_per_sec": 24
}
}'
curl -X POST \
http://0.0.0.0:8000/v1/infer \
-H 'Content-Type: application/json' \
-d '{
"prompt": "The video is a wide shot of a large industrial facility, likely a chemical plant or factory, situated in a rural or semi-industrial area. The scene is set during a partly cloudy day, with the sky showing patches of blue and white clouds. The facility is surrounded by a vast expanse of green fields, indicating its location in a countryside or suburban area. The factory itself is a large, rectangular building with a flat roof, constructed from concrete and metal. It features several large cylindrical tanks and pipes, suggesting the processing of chemicals or liquids. The tanks are arranged in a linear fashion along the side of the building, and there are several smaller structures and equipment scattered around the premises. The camera remains static throughout the video, capturing the entire facility from a distance, allowing viewers to observe the layout and scale of the operations. The lighting is natural, with sunlight casting shadows on the ground, enhancing the details of the industrial setup. There are no visible human activities or movements, indicating that the video might be a documentary or an informational piece about industrial processes.",
"negative_prompt": "blurry, low quality, artifacts, people",
"image": "https://assets.ngc.nvidia.com/products/api-catalog/cosmos/industry_01_prompt.jpg",
"seed": 42,
"guidance_scale": 7.5,
"steps": 35,
"video_params": {
"height": 704,
"width": 1280,
"frames_count": 121,
"frames_per_sec": 24
}
}'
curl -X POST \
http://0.0.0.0:8000/v1/infer \
-H 'Content-Type: application/json' \
-d '{
"prompt": "A first person view from the perspective from a human sized robot as it works in a chemical plant. The robot has many boxes and supplies nearby on the industrial shelves. The camera on moving forward, at a height of 1m above the floor. Photorealistic",
"negative_prompt": "blurry, low quality, artifacts, people",
"video": "https://assets.ngc.nvidia.com/products/api-catalog/cosmos/ar_result_default_robot.mp4",
"seed": 42,
"guidance_scale": 7.5,
"steps": 35,
"video_params": {
"height": 704,
"width": 1280,
"frames_count": 121,
"frames_per_sec": 24
}
}'
curl -H 'Content-Type: application/json' -X POST http://0.0.0.0:8000/v1/infer -d '{
"prompt": "Two robotic arms manipulate blue fabric on a yellow cushion in a neutral lab setting.",
"video": "https://raw.githubusercontent.com/abhinavg4/cosmos-transfer2.5/main/assets_nim/low/robot_input.mp4",
"resolution": "480",
"edge": {
"control_weight": 1.0,
"control": "https://raw.githubusercontent.com/abhinavg4/cosmos-transfer2.5/main/assets_nim/low/edge/robot_edge.mp4"
}
}'