DreamFusion

DreamFusion overview

DreamFusion uses a pretrained text-to-image diffusion model to perform text-to-3D synthesis. The model uses a loss based on probability density distillation that enables the use of a 2D diffusion model as a prior for optimization of a parametric image generator.

Using this loss in a DeepDream-like procedure, the model optimizes a randomly-initialized 3D model (a Neural Radiance Field, or NeRF) via gradient descent such that its 2D renderings from random angles achieve a low loss. The resulting 3D model of the given text can be viewed from any angle, relit by arbitrary illumination, or composited into any 3D environment. This approach requires no 3D training data and no modifications to the image diffusion model, demonstrating the effectiveness of pretrained image diffusion models as priors.

Remarks

  • Notable differences from the paper:

    • We use Stable Diffusion for the guidance model, while the paper uses Imagen.

    • The nerf model is trained in latent space for the first 20,000 iterations, then on the RGB space for the remainder of the training run.

    • The NeRF and renderer implementations are different from the paper, we provide multiple backends for each.

    • The training schedule, learning rates, optimizer and hyperparameters are also different from the paper.

  • This model is based on a number of research papers and open-source projects, including:

Feature

Training

Data parallelism

Yes

Tensor parallelism

No

Sequence parallelism

No

Activation checkpointing

Yes

FP32/TF32

Yes

AMP/BF16

Yes

BF16 O2

No

TransformerEngine/FP8

No

Multi-GPU

Yes

Multi-Node

No

Inference deployment

N/A

SW stack support

Slurm DeepOps/Base Command Manager/Base Command Platform

NVfuser

No

Distributed Optimizer

No

TorchInductor

Yes

Flash Attention

Yes

TorchNGP renderer

Yes

NerfAcc renderer

Yes

TCNN NeRF backend

Yes

HuggingFace Stable Diffusion backend

Yes

NeMo Stable Diffusion backend

Yes

NeMo-TRT Stable Diffusion backend

Yes

GUI

No