InstructPix2Pix

Model Introduction

InstructPix2Pix [InstructPix2Pix] [MM-MODELS-INSP2P1] offers a unique approach to image editing using human-written instructions. Given an input image and a textual directive, the model adjusts the image according to the provided instructions. NeMo Multimodal presents a training pipeline for this conditional diffusion model, utilizing a dataset generated by harnessing the strengths of two prominent pretrained models: a language model (GPT-3) and a text-to-image model (Stable Diffusion). The InstructPix2Pix model operates swiftly, editing images within seconds, eliminating the need for per-example fine-tuning or inversion. It has demonstrated remarkable results across a wide variety of input images and written instructions.

Built upon the Stable Diffusion framework, NeMo’s InstructPix2Pix shares a similar architecture with Stable Diffusion (refer to Stable Diffusion). What sets it apart is its unique training dataset and the combined guidance from both image and text prompts. Specifically, InstructPix2pix ::class::nemo.collections.multimodal.models.instruct_pix2pix.ldm.ddpm_edit.MegatronLatentDiffusionEdit is derived directly from Stable Diffusion’s ::class::nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.ddpm.MegatronLatentDiffusion, with alterations to accommodate the dataset and provide support for dual guidance.

Training Dataset

The dataset for NeMo’s InstructPix2Pix model stands out among NeMo multimodal models, as it doesn’t mandate data storage in the webdataset format. Users are advised to verify the dataset’s content, assess the relevant licenses, and ensure its appropriateness for their use. Before downloading, it’s essential to review any links associated with the dataset.

For instructions on downloading and preparing the custom dataset for training InstructPix2Pix, refer to the official InstructPix2Pix repository. Instruct-Pix2Pix Repository

Model Configuration

Data Configuration

data:
  data_path: ???
  num_workers: 2
  • data_path: Path to the instruct-pix2pix dataset. Users are required to specify this path. Further details on the dataset are available at Instruct-Pix2Pix Repository.

  • num_workers: Denotes the number of worker processes for data loading, determining the number of subprocesses used.

Essential Model Configuration

model:
  first_stage_key: edited
  cond_stage_key: edit # txt for cifar, caption for pbss

  unet_config:
    _target_: nemo.collections.multimodal.modules.stable_diffusion.diffusionmodules.openaimodel.UNetModel
    from_pretrained:
    in_channels: 8
  • first_stage_key: Key for the model’s initial processing stage. Set to edited for InstructPix2Pix.

  • cond_stage_key: Key for the model’s conditional stage. Set to edit for InstructPix2Pix.

  • unet_config: Configuration parameters for the UNet model within the NeMo collection. - _target_: Designates the target module for the UNet model in the NeMo collection. - from_pretrained: (Value not provided) Generally indicates the path or identifier of a pretrained model. - in_channels: Specifies the number of input channels for the UNet model. Here, the value is set to 8, with the initial 4 channels dedicated to image guidance.

    Additional model configurations align with Stable Diffusion (refer to Stable Diffusion).

References

MM-MODELS-INSP2P1

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: learning to follow image editing instructions. 2022. arXiv:arXiv:2211.09800.