Sampling Control#

NVIDIA NIM for Cosmos exposes a suite of sampling parameters for fine-grained control over the generation behavior of the model. Below is a complete reference on how to configure the sampling parameters of an inference request.

Sampling Parameters#

cosmos-predict1-7b-text2world

Name	Type	Default	Required	Description
prompt	str	—	Yes	The text prompt to use for output generation. Prompts up to 250 words in length are supported.
negative_prompt	str	“The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality.”	No	A negative text prompt that specifies elements to avoid in the video generation.
prompt_upsampling	bool	True	No	Whether to use prompt upsampling before generation to enhance prompt understanding.
seed	int	None	No	The seed that governs generation. Changing the seed with other inputs fixed results in different outputs. Valid seed values are in the range [0, 4294967295].
guidance_scale	float	7.0	No	The guidance (Classifier-Free Guidance) parameter that controls the balance between following the learned distribution of the model and amplifying the prompt influence during sample generation. Higher guidance values increase the weight of the prompt, resulting in sharper, more detailed, and highly aligned samples, but may reduce diversity and introduce artifacts. Lower guidance values produce more diverse samples closer to the learned distribution of the model, but may reduce alignment with the prompt. Adjusting this value allows you to fine-tune the trade-off between creativity and prompt adherence in the generated outputs. This value must be between 1.0 and 10.0.
steps	int	35	No	The number of diffusion sampling steps. Higher values generally produce better quality at the cost of longer generation time. This value must be in the range [1, 50].
video_params	dict	{ “height”: 704, “width”: 1280, “frames_count”: 121, “frames_per_sec”: 24 }	No	The resolution and timing parameters of the generated video. Supported resolutions include: 1:1 (960x960 pixels), 4:3 (960x704 pixels), 3:4 (704x960 pixels), 16:9 (1280x704 pixels), and 9:16 (704x1280 pixels). The frame rate can be adjusted within a range of 12 to 40 FPS.

cosmos-predict1-7b-video2world

Image Input

Name	Type	Default	Required	Description
image	str	—	Yes	URL to image location OR base64 encoded image. If `NIM_ALLOW_URL_INPUT` is set to 0, passing URL is disabled and base64 encoded visual input must be provided.
prompt	str	—	No	The text prompt to use for output generation. Prompts up to 250 words in length are supported. If text prompt is not provided, an image caption will be automatically generated and used as the text prompt.
negative_prompt	str	“The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality.”	No	A negative text prompt that specifies elements to avoid in the video generation.
seed	int	None	No	The seed that governs generation. Changing the seed with other inputs fixed results in different outputs. Valid seed values are in the range [0, 4294967295]. When not specified, a random seed is used.
guidance_scale	float	7.0	No	The guidance (Classifier-Free Guidance) parameter that controls the balance between following the learned distribution of the model and amplifying the prompt influence during sample generation. Higher guidance values increase the weight of the prompt, resulting in sharper, more detailed, and highly aligned samples, but may reduce diversity and introduce artifacts. Lower guidance values produce more diverse samples closer to the learned distribution of the model, but may reduce alignment with the prompt. Adjusting this value allows you to fine-tune the trade-off between creativity and prompt adherence in the generated outputs. This value must be between 1.0 and 10.0.
steps	int	35	No	The number of diffusion sampling steps. This value must be be in the range [1, 50].
video_params	dict	{ “height”: 704, “width”: 1280, “frames_count”: 121, “frames_per_sec”: 24 }	No	The resolution and timing parameters of the generated video. Supported resolutions include: 1:1 (960x960 pixels), 4:3 (960x704 pixels), 3:4 (704x960 pixels), 16:9 (1280x704 pixels), and 9:16 (704x1280 pixels). The frame rate can be adjusted within a range of 12 to 40 FPS.

Video Input

Name	Type	Default	Required	Description
video	str	—	Yes	URL to video location OR base64 encoded video. If `NIM_ALLOW_URL_INPUT` is set to 0, passing URL is disabled and base64 encoded visual input must be provided.
prompt	str	—	No	The text prompt to use for output generation. Prompts up to 250 words in length are supported. If text prompt is not provided, a video caption will be automatically generated and used as the text prompt.
negative_prompt	str	“The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality.”	No	A negative text prompt that specifies elements to avoid in the video generation.
seed	int	None	No	The seed that governs generation. Changing the seed with other inputs fixed results in different outputs. Valid seed values are in the range [0, 4294967295]. When not specified, a random seed is used.
guidance_scale	float	7.0	No	The guidance (Classifier-Free Guidance) parameter that controls the balance between following the learned distribution of the model and amplifying the prompt influence during sample generation. Higher guidance values increase the weight of the prompt, resulting in sharper, more detailed, and highly aligned samples, but may reduce diversity and introduce artifacts. Lower guidance values produce more diverse samples closer to the learned distribution of the model, but may reduce alignment with the prompt. Adjusting this value allows you to fine-tune the trade-off between creativity and prompt adherence in the generated outputs. This value must be between 1.0 and 10.0.
steps	int	35	No	The number of diffusion sampling steps. Higher values generally produce better quality at the cost of longer generation time. This value must be in the range [1, 50].
video_params	dict	{ “height”: 704, “width”: 1280, “frames_count”: 121, “frames_per_sec”: 24 }	No	The resolution and timing parameters of the generated video. Supported resolutions include: 1:1 (960x960 pixels), 4:3 (960x704 pixels), 3:4 (704x960 pixels), 16:9 (1280x704 pixels), and 9:16 (704x1280 pixels). The frame rate can be adjusted within a range of 12 to 40 FPS.

Examples#

The following examples demonstrate how to use the Cosmos models via API calls. Note that generation typically takes several minutes depending on the hardware used.

cosmos-predict1-7b-text2world

curl -X 'POST' \
'http://0.0.0.0:8000/v1/infer' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
            "prompt": "The teal robot is cooking food in a kitchen. Steam rises from a simmering pot as the robot chops vegetables on a worn wooden cutting board. Copper pans hang from an overhead rack, catching glints of afternoon light, while a well-loved cast iron skillet sits on the stovetop next to scattered measuring spoons and a half-empty bottle of olive oil.",
            "negative_prompt": "blurry, low quality, artifacts, people",
            "prompt_upsampling": true,
            "seed": 42,
            "guidance_scale": 7.5,
            "steps": 35,
            "video_params": {
                "height": 704,
                "width": 1280,
                "frames_count": 121,
                "frames_per_sec": 24
            }
    }'

cosmos-predict1-7b-video2world

Image Input

curl -X POST \
http://0.0.0.0:8000/v1/infer \
    -H 'Content-Type: application/json' \
    -d '{
        "prompt": "The video is a wide shot of a large industrial facility, likely a chemical plant or factory, situated in a rural or semi-industrial area. The scene is set during a partly cloudy day, with the sky showing patches of blue and white clouds. The facility is surrounded by a vast expanse of green fields, indicating its location in a countryside or suburban area. The factory itself is a large, rectangular building with a flat roof, constructed from concrete and metal. It features several large cylindrical tanks and pipes, suggesting the processing of chemicals or liquids. The tanks are arranged in a linear fashion along the side of the building, and there are several smaller structures and equipment scattered around the premises. The camera remains static throughout the video, capturing the entire facility from a distance, allowing viewers to observe the layout and scale of the operations. The lighting is natural, with sunlight casting shadows on the ground, enhancing the details of the industrial setup. There are no visible human activities or movements, indicating that the video might be a documentary or an informational piece about industrial processes.",
        "negative_prompt": "blurry, low quality, artifacts, people",
        "image": "https://assets.ngc.nvidia.com/products/api-catalog/cosmos/industry_01_prompt.jpg",
        "seed": 42,
        "guidance_scale": 7.5,
        "steps": 35,
        "video_params": {
            "height": 704,
            "width": 1280,
            "frames_count": 121,
            "frames_per_sec": 24
        }
    }'

Video Input

curl -X POST \
http://0.0.0.0:8000/v1/infer \
    -H 'Content-Type: application/json' \
    -d '{
        "prompt": "A first person view from the perspective from a human sized robot as it works in a chemical plant. The robot has many boxes and supplies nearby on the industrial shelves. The camera on moving forward, at a height of 1m above the floor. Photorealistic",
        "negative_prompt": "blurry, low quality, artifacts, people",
        "video": "https://assets.ngc.nvidia.com/products/api-catalog/cosmos/ar_result_default_robot.mp4",
        "seed": 42,
        "guidance_scale": 7.5,
        "steps": 35,
        "video_params": {
            "height": 704,
            "width": 1280,
            "frames_count": 121,
            "frames_per_sec": 24
        }
    }'