Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Flux#

Originating from Flux. Flux is a 12 billion parameter diffusion model based on transformers, designed to generate high-quality images from caption prompts. The NeMo implementation of Flux re-implements all transformer layers with Megatron Core and is enhanced by distributed training with custom FSDP and various Parallelisms (Coming soon with latest updates of Megatron-LM).

Import from Hugging Face to NeMo 2.0#

The Hugging Face checkpoint can be found at FLUX.1-dev. To run training or inference pipeline of Flux in NeMo, download the checkpoint locally and provide the path to the model config. It will be automatically converted to match NeMo structure.

Training#

For training, provide the checkpoint path to the model config in a preset recipe as follows:

@run.cli.factory(target=llm.train)
def convergence_test() -> run.Partial:
    recipe = flux_training()
    recipe.model.flux_params.flux_config = run.Config(
        FluxConfig,
        ckpt_path='/path/to/FLUX.1-dev/transformers',
        do_convert_from_hf=True,
        save_converted_model_to='/ckpts'
    )
    recipe.model.flux_params.t5_params = run.Config(
        T5Config, version='/ckpts/text_encoder_2'
    )
    recipe.model.flux_params.clip_params = run.Config(
        ClipConfig, version='/ckpts/text_encoder'
    )
    recipe.model.flux_params.vae_config = run.Config(
        AutoEncoderConfig, ckpt='/ckpts/ae.safetensors', ch_mult=[1, 2, 4, 4], attn_resolutions=[]
    )
    return recipe

The command above saves the converted file in the path specified in save_converted_model_to and names it as nemo_flux_transformer.safetensors. If `save_converted_model_to is not provided, the conversion will still be performed, but the converted checkpoint will not be saved.

At later runs, when the path to nemo_flux_transformer.safetensors is provided to flux_config, do_convert_from_hf should be set to False.

The checkpoints to other components like Variational Auto Encoder (VAE), CLIP and T5 are set similarly in the recipe as shown in the example above.

Inference#

For Flux inference, the provided script also accepts Hugging Face checkpoints, similar to the training scripts.

Explanation of Parameters#

  • flux_ckpt Path to Flux transformer checkpoints. Accepted formats include safetensors in the transformer folder downloaded from the link above (e.g., /FLUX-1.dev/transfomer/)

or converted safetensors in NeMo format.

  • do_convert_from_hf: To convert the downloaded checkpoint into NeMo format, –do_convert_from_hf is required as a

commandline argument.

  • save_converted_model_to: The converted Flux transformer weights will be saved locally to /ckpts by default.

users can specify this path by –save_converted_model_to.

  • vae_ckpt: Path to the VAE checkpoint, which can be found at link above in FLUX-1.dev/ae.safetensors/.

  • clip_version: Version of the CLIP text encoder to process input prompts, such as openai/clip-vit-large-patch14, which

will be downloaded from the server automatically. Alternatively, you can directly provide the path to locally stored CLIP checkpoint, for example, FLUX-1.dev/text_encoder.

  • t5_version: Version of the T5 text encoder to process input prompts, such as google/t5-v1_1-xxl, which

will be downloaded from the server automatically. Alternatively, you can directly provide the path to locally stored T5 checkpoint, for example, FLUX-1.dev/text_encoder_2.

  • width & height: The dimensions of the final output image.

  • inference_steps: Number of denoising steps to run for each prompt.

  • num_iamges_per_prompt: Number of inference images generated for each prompt.

  • prompts: Text prompts for image generation. When providing multiple prompts in a single run, use , to separate

different prompts.

  • guidance: The guidance scale controls the strength of prompt conditioning.

Example Usage of The Script#

torchrun flux_infer.py --flux_ckpt /ckpts/nemo_flux_transformer.safetensors \
--prompts  'A cat holding a sign that says hello world'
--inference_steps 50 --guidance 3.5

Training Recipes#

We provide pre-defined training recipes for Flux, which can be used for performance benchmarking and convergence test using NeMo-Run. These recipes are hosted in flux_training.py.

By default, the Flux model will be randomly initialized, with both the number of Flux transformer layers and Flux single transformer layers set to 1. This default recipe also uses MockDataModule, which generates random tensors for image and sample text input.

In order to launch training with your own dataset, follow the example below to override the default recipe with your own config:

@run.cli.factory(target=llm.train)
def convergence_test() -> run.Partial:
    recipe = flux_training()  ##Predefined base recipe where most default values are set
    ### Load checkpoints for embedder components
    recipe.model.flux_params.t5_params = run.Config(
        T5Config, version='/ckpts/text_encoder_2'
    )
    recipe.model.flux_params.clip_params = run.Config(
        ClipConfig, version='/ckpts/text_encoder'
    )
    recipe.model.flux_params.vae_config = run.Config(
        AutoEncoderConfig, ckpt='/ckpts/ae.safetensors', ch_mult=[1, 2, 4, 4], attn_resolutions=[]
    )
    recipe.model.flux_params.device = 'cuda'
    recipe.trainer.devices = 8         ### Number of GPUs per node
    ### Change data module from MockDataModule to DiffusionDataModule
    ### Provide the path to Megatron-Energon Compatible dataset
    recipe.data = flux_datamodule('/dataset/fill50k/fill50k_tarfiles/')
    recipe.model.flux_params.flux_config = run.Config(
        FluxConfig,
        num_joint_layers=19,
        num_single_layers=38
    )   ## Adjust the number of transfomer layers
    return recipe

Here is an example command to run your own customized recipe:

Example usage of the script

torchrun --nproc_per_node 8 flux_training.py --yes --factory convergence_test

Flux ControlNet#

Training#

Flux ControlNet training script, flux_controlnet_training.py, is implemented with NeMo-Run. The FluxControlNetConfig inlcudes all components needed to initialize a Flux model, plus configurationss that determine the architecture of ControlNet part. For example,

@run.cli.factory(target=llm.train)
def unit_test() -> run.Partial:
    ...
    recipe.model.flux_controlnet_config.num_single_layers = 1
    recipe.model.flux_controlnet_config.num_joint_layers = 1
    # Initialize transformer blocks in ContolNet from the weights of corresponding flux blocks
    recipe.model.flux_controlnet_config.load_from_flux_transformer = True

Inference#

The Flux ControlNet inference scripts, located at flux_controlnet_infer.py, operate similarly to the standard Flux inference scripts. However, in addition to the checkpoints needed for Flux, the Flux ControlNet also requires additional inputs.

Additional Parameters#

  • controlnet_ckpt: This parameter accepts ControlNet checkpoints from Hugging Face. When using the –do_convert_from_hf flag, both the Flux checkpoint and the

Flux ControlNet checkpoint are converted and saved for Flux inference.

  • control_image: The path to the control image used as an addtional condition in the inference pipeline.

  • num_joint_layers: This parameter now determines the number of transformer layers in ControlNet. The number of transformer layers in the Flux model remains at the default setting as in FLUX.1-dev. The same applies to num_single_layers.

Energon-Compatible Dataset#

For details on preparing dataset with Energon, refer to data preparation section. For Flux usage, we adapt the WebDataset format, where image-text pairs are saved in tarfiles with the same name but different extensions.

.
├── 001.txt
├── 001.jpg
├── 002.txt
└── 002.jpg

The flux_datamodule should be prepared by Energon as CrudeDataset. We use a dummy task encoder to load raw images and text inputs. Basic data augmentations will be added to the flux_datamodule soon.

For training Flux ControlNet, .jpg file are used as the original images, while .png should be the corresponding control inputs.