InstructPix2Pix

User Guide (Latest Version)

InstructPix2Pix [InstructPix2Pix] [MM-MODELS-INSP2P1] offers a unique approach to image editing using human-written instructions. Given an input image and a textual directive, the model adjusts the image according to the provided instructions. NeMo Multimodal presents a training pipeline for this conditional diffusion model, utilizing a dataset generated by harnessing the strengths of two prominent pretrained models: a language model (GPT-3) and a text-to-image model (Stable Diffusion). The InstructPix2Pix model operates swiftly, editing images within seconds, eliminating the need for per-example fine-tuning or inversion. It has demonstrated remarkable results across a wide variety of input images and written instructions.

Built upon the Stable Diffusion framework, NeMo’s InstructPix2Pix shares a similar architecture with Stable Diffusion (refer to Stable Diffusion). What sets it apart is its unique training dataset and the combined guidance from both image and text prompts. Specifically, InstructPix2pix ::class::nemo.collections.multimodal.models.instruct_pix2pix.ldm.ddpm_edit.MegatronLatentDiffusionEdit is derived directly from Stable Diffusion’s ::class::nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.ddpm.MegatronLatentDiffusion, with alterations to accommodate the dataset and provide support for dual guidance.

The dataset for NeMo’s InstructPix2Pix model stands out among NeMo multimodal models, as it doesn’t mandate data storage in the webdataset format. Users are advised to verify the dataset’s content, assess the relevant licenses, and ensure its appropriateness for their use. Before downloading, it’s essential to review any links associated with the dataset.

For instructions on downloading and preparing the custom dataset for training InstructPix2Pix, refer to the official InstructPix2Pix repository. Instruct-Pix2Pix Repository

Data Configuration

Copy
Copied!
            

data: data_path: ??? num_workers: 2

  • data_path: Path to the instruct-pix2pix dataset. Users are required to specify this path. Further details on the dataset are available at Instruct-Pix2Pix Repository.

  • num_workers: Denotes the number of worker processes for data loading, determining the number of subprocesses used.

Essential Model Configuration

Copy
Copied!
            

model: first_stage_key: edited cond_stage_key: edit # txt for cifar, caption for pbss unet_config: _target_: nemo.collections.multimodal.modules.stable_diffusion.diffusionmodules.openaimodel.UNetModel from_pretrained: in_channels: 8

  • first_stage_key: Key for the model’s initial processing stage. Set to edited for InstructPix2Pix.

  • cond_stage_key: Key for the model’s conditional stage. Set to edit for InstructPix2Pix.

  • unet_config: Configuration parameters for the UNet model within the NeMo collection. - _target_: Designates the target module for the UNet model in the NeMo collection. - from_pretrained: (Value not provided) Generally indicates the path or identifier of a pretrained model. - in_channels: Specifies the number of input channels for the UNet model. Here, the value is set to 8, with the initial 4 channels dedicated to image guidance.

    Additional model configurations align with Stable Diffusion (refer to Stable Diffusion).

Feature

Description

To Enable

Data parallelism Dataset read concurrently Automatically when training on multi GPUs/nodes
Activation Checkpointing Reduce memory usage by clearing activations of certain layers and recomputing them during a backward pass model.unet_config.use_checkpoint=True
Bfloat16 Training Training in Bfloat16 precision trainer.precision=bf16
Flash Attention ast and Memory-Efficient Exact Attention with IO-Awareness model.unet_config.use_flash_attention=True
Channels Last ordering NCHW tensors in memory preserving dimensions ordering. model.channels_last=True
Inductor TorchInductor compiler model.inductor=True

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: learning to follow image editing instructions. 2022. arXiv:arXiv:2211.09800.

Previous ControlNet
Next Stable Diffusion XL Int8 Quantization
© | | | | | | |. Last updated on Jun 24, 2024.