Cosmos-Transfer1#

Cosmos-Transfer1 is a set of pre-trained, diffusion-based conditional world models designed for multimodal, controllable world generation. These models can create world simulations based on multiple spatial control inputs across various modalities such as segmentation, depth, and edge maps. Cosmos-Transfer1 offers the flexibility to weight different conditional inputs differently at varying spatial locations and temporal instances, enabling highly customizable world generation. This capability is particularly useful for various world-to-world transfer applications, including Sim2Real.

The architecture of Cosmos-Transfer1 is shown in the following figure:

../_images/transfer1_diagram.png

Cosmos-Transfer1 includes the following components:

  • ControlNet-based single modality conditional world generation: Generate visual simulation based on one of the following modalities: segmentation video, depth video, edge video, blur video, LiDAR video, or HDMap video. Cosmos-Transfer1 generates a video based on the signal modality, conditional input, user text prompt, and, optionally, an input RGB video frame prompt (which could be from the last video generation result when operating in the autoregressive setting). We will use Cosmos-Transfer1-7B [Modality] to refer to the model operating in this setting. For example, Cosmos-Transfer1-7B [Depth] refers to a depth ControlNet model.

  • MultiControlNet-based multimodal conditional world generation: Generate visual simulation based on any combination of segmentation video, depth video, edge video, and blur video (LiDAR video and HDMap in the AV sample) with a spatiotemporal control map to control the strength of each modality across space and time. Cosmos-Transfer1 generates a video based on the multimodal conditional inputs, a user text prompt, and, optionally, an input RGB video frame prompt (This could be from the last video generation result when operating in the autoregressive setting). This is the preferred mode of Cosmos-Transfer. We will refer it as Cosmos-Transfer1-7B.

  • 4KUpscaler: Allows for upscaling 720p-resolution video to 4K-resolution.

Examples#

Cosmos-Transfer1-7B#

Cosmos-Transfer1-7B [LiDAR|HDMap]#