InstructPix2Pix

InstructPix2Pix introduces a method for editing images based on human-written instructions. Given an input image and a textual directive, the model follows these instructions to modify the image accordingly.

NeMo Multimodal offers a training pipeline for conditional diffusion models using the edit dataset. Additionally, we provide a tool that generates modified images based on user-written instructions during the inference process.

Feature	Training	Inference
Data parallelism	Yes	N/A
Tensor parallelism	No	No
Pipeline parallelism	No	No
Sequence parallelism	No	No
Activation checkpointing	No	No
FP32/TF32	Yes	Yes (FP16 enabled by default)
AMP/FP16	Yes	Yes
AMP/BF16	Yes	No
BF16 O2	No	No
TransformerEngine/FP8	No	No
Multi-GPU	Yes	Yes
Multi-Node	Yes	Yes
Inference deployment	N/A	NVIDIA Triton supported
SW stack support	Slurm DeepOps/Base Command Manager/Base Command Platform	Slurm DeepOps/Base Command Manager/Base Command Platform
NVfuser	No	N/A
Distributed Optimizer	No	N/A
TorchInductor	Yes	N/A
Flash Attention	Yes	N/A