Multimodal API

class nemo.collections.nlp.models.language_modeling.megatron_base_model.MegatronBaseModel(*args: Any, **kwargs: Any)

Bases: nemo.collections.nlp.models.nlp_model.NLPModel

Megatron base class. All NeMo Megatron models inherit from this class.

  • Initialize the model parallel world for nemo.

  • Turn on all of the nvidia optimizations.

  • If cfg.tokenizer is available, it loads the tokenizer and pad the vocab to the correct size for tensor model parallelism.

  • If using distributed optimizer, configure to be compatible with O2 level optimizations and/or model parallelism.

  • Perform gradient clipping: grad_clip_pl_default triggers the PyTorch Lightning default implementation, with_distributed_adam triggers the distributed optimizer’s implementation, megatron_amp_O2 triggers gradient clipping on the main grads, and otherwise gradient clipping is performed on the model grads.

__init__(cfg: omegaconf.dictconfig.DictConfig, trainer: pytorch_lightning.trainer.trainer.Trainer, no_lm_init=True)

Base class from which all NeMo models should inherit

Parameters
  • cfg (DictConfig) –

    configuration object. The cfg object should have (optionally) the following sub-configs:

    • train_ds - to instantiate training dataset

    • validation_ds - to instantiate validation dataset

    • test_ds - to instantiate testing dataset

    • optim - to instantiate optimizer with learning rate scheduler

  • trainer (Optional) – Pytorch Lightning Trainer instance

class nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.ddpm.MegatronLatentDiffusion(*args: Any, **kwargs: Any)

Bases: nemo.collections.nlp.parts.mixins.nlp_adapter_mixins.NLPAdapterModelMixin, nemo.collections.nlp.models.language_modeling.megatron_base_model.MegatronBaseModel

Megatron LatentDiffusion Model.

__init__(cfg: omegaconf.DictConfig, trainer: pytorch_lightning.Trainer)

Base class from which all NeMo models should inherit

Parameters
  • cfg (DictConfig) –

    configuration object. The cfg object should have (optionally) the following sub-configs:

    • train_ds - to instantiate training dataset

    • validation_ds - to instantiate validation dataset

    • test_ds - to instantiate testing dataset

    • optim - to instantiate optimizer with learning rate scheduler

  • trainer (Optional) – Pytorch Lightning Trainer instance

setup(stage=None)
PTL hook that is executed after DDP spawns.

We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.

Parameters

stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.

training_step(batch)

Notice: training_step used to have the following signature to support pipeline parallelism:

def training_step(self, dataloader_iter, batch_idx):


However, full iteration CUDA Graph callback is not compatible with this signature right now, due to we need to wrap the dataloader to generate static tensor outside the CUDA Graph. This signature moves next(dataloader) into the CUDA Graph capturing region, thus we disabled it.

Our dataloaders produce a micro-batch and then we fetch a number of microbatches depending on the global batch size and model parallel size from the dataloader to produce a list of microbatches. Batch should be a list of microbatches and those microbatches should on CPU. Microbatches are then moved to GPU during the pipeline. The list of microbatches is then piped through the pipeline using Apex fwd/bwd functions.

class nemo.collections.multimodal.modules.stable_diffusion.diffusionmodules.openaimodel.UNetModel(*args: Any, **kwargs: Any)

Bases: torch.nn.Module

The full UNet model with attention and timestep embedding.

Parameters
  • in_channels (int) – The number of channels in the input Tensor.

  • model_channels (int) – The base channel count for the model.

  • out_channels (int) – The number of channels in the output Tensor.

  • num_res_blocks (int) – The number of residual blocks per downsample.

  • attention_resolutions (set/list/tuple) – The downsampling rates at which attention is applied. For example, if this includes 4, attention is used at 4x downsampling.

  • dropout (float) – The dropout probability.

  • channel_mult (list/tuple) – A channel multiplier for each level of the UNet.

  • conv_resample (bool) – If True, use learned convolutions for upsampling and downsampling.

  • dims (int) – Determines if the signal is 1D, 2D, or 3D.

  • num_classes (int, optional) – If specified, the model becomes class-conditional with the given number of classes.

  • use_checkpoint (bool) – If True, use gradient checkpointing to reduce memory usage.

  • num_heads (int) – The number of attention heads in each attention layer.

  • num_heads_channels (int, optional) – If specified, overrides num_heads and uses a fixed channel width per attention head.

  • num_heads_upsample (int, optional) – Sets a different number of heads for upsampling. Deprecated.

  • use_scale_shift_norm (bool) – If True, use a FiLM-like conditioning mechanism.

  • resblock_updown (bool) – If True, use residual blocks for up/downsampling.

  • use_new_attention_order (bool) – If True, use a different attention pattern for potentially increased efficiency.

class nemo.collections.multimodal.modules.imagen.diffusionmodules.nets.UNetModel(*args: Any, **kwargs: Any)

Bases: torch.nn.Module

The full UNet model with attention and timestep embedding used for Imagen Base and SR model.

Parameters
  • embed_dim – Dimension of embeddings. Also used to calculate the number of channels in ResBlock.

  • image_size – Input image size. Used to calculate where to inject attention layers in UNet.

  • channels – Input channel number, defaults to 3.

  • text_embed_dim – Dimension of conditioned text embedding. Different text encoders and different model versions have different values, defaults to 512

  • num_res_blocks – Number of ResBlock in each level of UNet, defaults to 3.

  • channel_mult – Used with embed_dim to calculate the number of channels for each level of UNet, defaults to [1, 2, 3, 4]

  • num_attn_heads – The number of heads in the attention layer, defaults to 4.

  • per_head_channels – The number of channels per attention head, defaults to 64.

  • cond_dim – Dimension of Conditioning projections, defaults to 512.

  • attention_type – Type of attention layer, defaults to ‘fused’.

  • feature_pooling_type – Type of pooling, defaults to ‘attention’.

  • learned_sinu_pos_emb_dim – Dimension of learned time positional embedding. 0 for unlearned timestep embeddings. Defaults to 16

  • attention_resolutions – List of resolutions to inject attention layers. Defaults to [8, 16, 32]

  • dropout – The rate of dropout, defaults to 0.

  • use_null_token – Whether to create a learned null token for attention, defaults to False.

  • init_conv_kernel_size – Initial Conv kernel size, defaults to 3.

  • gradient_checkpointing – Whether to use gradient checkpointing, defaults to False.

  • scale_shift_norm – Whether to use scale shift norm, defaults to False.

  • stable_attention – Whether to use numerically-stable attention calculation, defaults to True.

  • flash_attention – Whether to use flash attention calculation, defaults to False.

  • resblock_updown – Whether to use ResBlock or Downsample/Upsample, defaults to False.

  • resample_with_conv – When resblock_updown=False, whether to use conv in addition to Pooling&ConvTranspose. Defaults to True.

  • low_res_cond – Whether conditioned on low-resolution input, used for SR model. Defaults to False.

  • noise_cond_aug – Whether to add noise conditioned augmentation with low-resolution input. Defaults to False.

class nemo.collections.multimodal.modules.imagen.diffusionmodules.nets.EfficientUNetModel(*args: Any, **kwargs: Any)

Bases: torch.nn.Module

The full Efficient UNet model with attention and timestep embedding used for Imagen SR model.

Parameters
  • embed_dim – Dimension of embeddings. Also used to calculate the number of channels in ResBlock.

  • image_size – Input image size. Used to calculate where to inject attention layers in UNet.

  • channels – Input channel number, defaults to 3.

  • text_embed_dim – Dimension of conditioned text embedding. Different text encoders and different model versions have different values, defaults to 512

  • channel_mult – Used with embed_dim to calculate the number of channels for each level of UNet, defaults to [1, 1, 2, 4, 8].

  • num_attn_heads – The number of heads in the attention layer, defaults to 8.

  • per_head_channels – The number of channels per attention head, defaults to 64.

  • attention_type – Type of attention layer, defaults to ‘fused’.

  • atnn_enabled_at – Whether to enable attention at each level, defaults to [0, 0, 0, 0, 1].

  • feature_pooling_type – Type of pooling, defaults to ‘attention’.

  • stride – Stride in ResBlock, defaults to 2.

  • num_resblocks – Used with num_res_blocks to calculate the number of residual blocks at each level of Efficient-UNet. Defaults to [1, 2, 4, 8, 8].

  • learned_sinu_pos_emb_dim – Dimension of learned time positional embedding. 0 for unlearned timestep embeddings. Defaults to 16

  • use_null_token – Whether to create a learned null token for attention, defaults to False.

  • init_conv_kernel_size – Initial Conv kernel size, defaults to 3.

  • gradient_checkpointing – Whether to use gradient checkpointing, defaults to False.

  • scale_shift_norm – Whether to use scale shift norm, defaults to False.

  • stable_attention – Whether to use numerically-stable attention calculation, defaults to True.

  • flash_attention – Whether to use flash attention calculation, defaults to False.

  • skip_connection_scaling – Whether to use 1/sqrt(2) scaling for ResBlock skip connection, defaults to False.

  • noise_cond_aug – Whether to add noise conditioned augmentation with low-resolution input. Defaults to False.

class nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.autoencoder.AutoencoderKL(*args: Any, **kwargs: Any)

Bases: pytorch_lightning.LightningModule

__init__(ddconfig, embed_dim, lossconfig=None, ckpt_path=None, ignore_keys=[], image_key='image', colorize_nlabels=None, monitor=None, from_pretrained: Optional[str] = None)

decode(z)

Decode latent representation back to pixel space.

encode(x)

Encode input image in pixel space to latent representation.

class nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.FrozenMegatronCLIPEmbedder(*args: Any, **kwargs: Any)

Bases: nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.AbstractEmbModel

__init__(restore_from_path, device='cuda', layer='last', freeze=True, cfg=None, always_return_pooled=False, enable_lora_finetune=False)

forward(text)

Get embeddings from input text

class nemo.collections.multimodal.modules.imagen.encoder.t5encoder.T5Encoder(*args: Any, **kwargs: Any)

Bases: torch.nn.Module

__init__(max_seq_len=512, encoder_path=None)

Initialize the T5 Encoder.

Parameters
  • max_seq_len – Maximum token length, defaults to 512

  • encoder_path – Optional if loaded T5 on the disk, defaults to None

encode(text_batch, device='cuda')

Encode a batch of text to T5 embeddings.

class nemo.collections.multimodal.data.common.webdataset.WebDatasetCommon(*args: Any, **kwargs: Any)

Bases: nemo.core.classes.dataset.IterableDataset

A common dataset object shared by most of NeMo multimodal models.

class nemo.collections.multimodal.data.dreambooth.dreambooth_dataset.DreamBoothDataset(*args: Any, **kwargs: Any)

Bases: torch.utils.data.Dataset

A dataset to prepare the instance and class images with the prompts for fine-tuning the model. It pre-processes the images and the tokenizes prompts.

Parameters
  • instance_data_root – required, a directory with images files of the object

  • instance_prompt – captions with special token associated with instance images

  • with_prior_preservation – whether to regularize the model finetuning with the original inference output from the backbone

  • reg_data_root – a directory to save inference images from the backbone

  • reg_prompt – prompt used to generate regularization images

  • size – resizing images for training data pipeline

  • center_crop – whether performing center cropping on input images

  • load_cache_latents – when set to True, images will be converted to cached latents which will be directly loaded for training

  • vae – vae instance to encode imamges from pixel space to latent space

Previous DreamFusion
Next Text-to-Speech (TTS)
© Copyright 2023-2024, NVIDIA. Last updated on Apr 25, 2024.