Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
InstructPix2Pix#
Model Introduction#
InstructPix2Pix [MM-MODELS-INSP2P1] offers a unique approach to image editing using human-written instructions. Given an input image and a textual directive, the model adjusts the image according to the provided instructions. NeMo Multimodal presents a training pipeline for this conditional diffusion model, utilizing a dataset generated by harnessing the strengths of two prominent pretrained models: a language model (GPT-3) and a text-to-image model (Stable Diffusion). The InstructPix2Pix model operates swiftly, editing images within seconds, eliminating the need for per-example fine-tuning or inversion. It has demonstrated remarkable results across a wide variety of input images and written instructions.
Built upon the Stable Diffusion framework, NeMo’s InstructPix2Pix shares a similar architecture with Stable Diffusion (refer to Stable Diffusion). What sets it apart is its unique training dataset and the combined guidance from both image and text prompts. Specifically, InstructPix2pix ::class::nemo.collections.multimodal.models.instruct_pix2pix.ldm.ddpm_edit.MegatronLatentDiffusionEdit
is derived directly from Stable Diffusion’s ::class::nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.ddpm.MegatronLatentDiffusion
, with alterations to accommodate the dataset and provide support for dual guidance.
Training Dataset#
The dataset for NeMo’s InstructPix2Pix model stands out among NeMo multimodal models, as it doesn’t mandate data storage in the webdataset format. Users are advised to verify the dataset’s content, assess the relevant licenses, and ensure its appropriateness for their use. Before downloading, it’s essential to review any links associated with the dataset.
For instructions on downloading and preparing the custom dataset for training InstructPix2Pix, refer to the official Instruct-Pix2Pix Repository
Model Configuration#
Data Configuration#
data:
data_path: ???
num_workers: 2
data_path
: Path to the instruct-pix2pix dataset. Users are required to specify this path. Further details on the dataset are available at Instruct-Pix2Pix Repository.num_workers
: Denotes the number of worker processes for data loading, determining the number of subprocesses used.
Essential Model Configuration#
model:
first_stage_key: edited
cond_stage_key: edit # txt for cifar, caption for pbss
unet_config:
_target_: nemo.collections.multimodal.modules.stable_diffusion.diffusionmodules.openaimodel.UNetModel
from_pretrained:
in_channels: 8
first_stage_key
: Key for the model’s initial processing stage. Set to edited for InstructPix2Pix.cond_stage_key
: Key for the model’s conditional stage. Set to edit for InstructPix2Pix.unet_config
: Configuration parameters for the UNet model within the NeMo collection._target_
: Designates the target module for the UNet model in the NeMo collection.from_pretrained
: (Value not provided) Generally indicates the path or identifier of a pretrained model.in_channels
: Specifies the number of input channels for the UNet model. Here, the value is set to 8, with the initial 4 channels dedicated to image guidance.
Additional model configurations align with Stable Diffusion (refer to Stable Diffusion).
References#
Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: learning to follow image editing instructions. 2022. arXiv:arXiv:2211.09800.