Cosmos-Transfer2.5#
Cosmos-Transfer2.5 is a multi-controlnet model designed to accept structured input of multiple video modalities including RGB, depth, segmentation and more. Users can configure generation using JSON-based controlnet_specs, and run inference with just a few commands. It supports both single-video inference, automatic control map generation, and multiple GPU setups.
Cosmos-Transfer2.5 can be used to generate training data via two data augmentation workflows: Simulations to Photorealism and Scale World State Diversity.
Simulations to Photorealism#
Minimizes the need for achieving high fidelity in 3D simulation.
Input prompt#
The video is a demonstration of robotic manipulation, likely in a laboratory or testing environment. It features two robotic arms interacting with a piece of blue fabric.
Click to view the rest of the prompt.
The setting is a room with a beige couch in the background, providing a neutral backdrop for the robotic activity. The robotic arms are positioned on either side of the fabric, which is placed on a yellow cushion. The left robotic arm is white with a black gripper, while the right arm is black with a more complex, articulated gripper. At the beginning, the fabric is laid out on the cushion. The left robotic arm approaches the fabric, its gripper opening and closing as it positions itself. The right arm remains stationary initially, poised to assist. As the video progresses, the left arm grips the fabric, lifting it slightly off the cushion. The right arm then moves in, its gripper adjusting to grasp the opposite side of the fabric. Both arms work in coordination, lifting and holding the fabric between them. The fabric is manipulated with precision, showcasing the dexterity and control of the robotic arms. The camera remains static throughout, focusing on the interaction between the robotic arms and the fabric, allowing viewers to observe the detailed movements and coordination involved in the task.
Input Video#
Computed Control#
Output Video#
Scale World State Diversity#
Leverages sensor captured RGB or ground truth augmentations.
Input prompt#
The video is a driving scene through a modern urban environment, likely captured from a dashcam or a similar fixed camera setup inside a vehicle.
Click to view the rest of the prompt.
The scene unfolds on a wide, multi-lane road flanked by tall, modern buildings with glass facades. The road is relatively empty, with only a few cars visible, including a black car directly ahead of the camera, maintaining a steady pace. The camera remains static, providing a consistent view of the road and surroundings as the vehicle moves forward.On the left side of the road, there are several trees lining the sidewalk, providing a touch of greenery amidst the urban setting. Pedestrians are visible on the sidewalks, some walking leisurely, while others stand near the buildings. The buildings are a mix of architectural styles, with some featuring large glass windows and others having more traditional concrete exteriors. A few commercial signs and logos are visible on the buildings, indicating the presence of businesses and offices.Traffic cones are placed on the road ahead, suggesting some form of roadwork or lane closure, guiding the vehicles to merge or change lanes. The road markings are clear, with white arrows indicating the direction of travel. The sky is clear, suggesting a sunny day, which enhances the visibility of the scene. Throughout the video, the vehicle maintains a steady speed, and the camera captures the gradual approach towards the intersection, where the road splits into different directions. The overall atmosphere is calm and orderly, typical of a city during non-peak hours.