Frequently Asked Questions#

Can Cosmos WFMs capture brief dynamic actions?

Yes, you can prompt Text2World models to generate videos that match specific dynamics.

What perspective do videos generated by the Cosmos Predict1 models use?

The perspective of generated videos can be specified using video/text prompts. If you post-train a model using videos from a particular model of autonomous vehicle (AV) or robot, the model will preferentially generate videos from its perspective (including multi-camera views)

Do Cosmos WFMs generate videos with collisions or other unsafe scenarios?

Cosmos models don’t deliberately avoid generating collision videos. On the contrary, out of distribution (OOD) data can better assist the downstream AV system in evaluating and fine-tuning for corner cases.

Videos generated by Cosmos Predict1 are only 5 seconds in length. Can I generate longer videos?

You can use Text2World to generate the first 5 second slice, then use Video2World to generate a second 5 second slice.

Can videos generated by Cosmos Predict1 be in higher resolution?

You can use the Transfer1 4K Upscaler to upscale generated videos to 4K resolution.

Is human intervention required for video labeling?

No, the video-labeling pipeline is fully automated. The Cosmos team has worked to fine-tune the filtering and annotation models used in the pipeline to ensure that the labels are accurate and consistent.

Can I use assets from 3D engines like Unreal Engine and Omniverse?

Yes, you can use Transfer1 to convert videos from any simulation engine into a realistic style.

How long can input videos be for Predict1-Video2World inference?

The Cosmos-Predict1-7B/14B-Video2World models can currently perform inference on up to 9 video frames. If the input video is longer than this, the model will use the last 9 frames of the video for inference.