> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# 🤗 Transformers API Compatibility

NeMo AutoModel is built to work with the 🤗 Hugging Face ecosystem.
In practice, compatibility comes in two layers:

* **API compatibility**: for many workflows, you can keep your existing `transformers` code and swap in NeMo AutoModel “drop-in” wrappers (`NeMoAutoModel*`, `NeMoAutoTokenizer`) with minimal changes.
* **Artifact compatibility**: NeMo AutoModel produces **Hugging Face-compatible checkpoints** (config + tokenizer + safetensors) that can be loaded by Hugging Face Transformers and downstream tools (vLLM, SGLang, etc.).

This page summarizes what "HF compatibility" means in NeMo AutoModel, calls out differences you should be aware of, and provides side-by-side examples.

## Transformers Version Compatibility

### Transformers v5 (Current Default)

NeMo AutoModel currently pins Hugging Face Transformers to the **v5** major line (see `pyproject.toml`, currently `transformers==5.5.0`).

This means:

* NeMo AutoModel is primarily tested and released against **Transformers v5.x**.
* New model releases on the Hugging Face Hub that require a newer Transformers may require upgrading NeMo AutoModel as well (similar to upgrading `transformers` directly).

### Transformers v4 Interoperability

Some downstream environments may still run Transformers **v4**, while NeMo AutoModel development and tests now target Transformers **v5**.

NeMo AutoModel keeps v4 interoperability where practical:

* **Compatibility shims**: NeMo AutoModel includes small compatibility patches to smooth over known API differences across Transformers releases (for example, cache utility method names). The built-in recipes apply these patches automatically.
* **Backports where needed**: for some model families, NeMo AutoModel may vendor/backport Hugging Face code so users can run models whose upstream integration has moved between major Transformers releases.
* **Stable artifact format**: NeMo AutoModel checkpoints are written in Hugging Face-compatible `save_pretrained` layouts (config + tokenizer + safetensors). These artifacts are intended for standard HF loading APIs and non-Transformers tools that consume HF-style model repos.

If you need to consume NeMo AutoModel-produced consolidated checkpoints in a Transformers v4 environment, validate that specific model family and downstream tool path. For details on the checkpoint layouts, see [checkpointing](/development/checkpointing).

## Drop-In Compatibility and Key Differences

### Drop-In (Same Mental Model as Transformers)

* **Load by model ID or local path**: `from_pretrained(...)`
* **Standard HF config objects**: `AutoConfig` / `config.json`
* **Tokenizers**: standard `PreTrainedTokenizerBase` behavior, including `__call__` to create tensors and `decode`/`batch_decode`
* **Generation**: `model.generate(...)` and the usual generation kwargs

### Differences (Where NeMo AutoModel Adds Value or Has Constraints)

* **Performance features**: NeMo AutoModel can automatically apply optional kernel patches/optimizations (e.g., SDPA selection, Liger kernels, DeepEP, etc.) while keeping the public model API the same.
* **Distributed training stack**: NeMo AutoModel's recipes/CLI are designed for multi-GPU/multi-node fine-tuning with PyTorch-native distributed features (FSDP2, pipeline parallelism, etc.).
* **CUDA expectation**: NeMo AutoModel's `NeMoAutoModel*` wrappers are primarily optimized for NVIDIA GPU workflows, and offer support for CPU workflows as well.

`NeMoAutoModelForCausalLM.from_pretrained(...)` currently assumes CUDA is available (it uses `torch.cuda.current_device()` internally). If you need CPU-only inference, use Hugging Face `transformers` directly.

## API Mapping (Transformers and NeMo AutoModel)

### API Name Mapping

<table>
  <thead>
    <tr>
      <th>
        🤗 Hugging Face (

        <code>transformers</code>

        )
      </th>

      <th>
        NeMo AutoModel (

        <code>nemo_automodel</code>

        )
      </th>

      <th>
        Status
      </th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>
        <code>transformers.AutoModelForCausalLM</code>
      </td>

      <td>
        <code>nemo_automodel.NeMoAutoModelForCausalLM</code>
      </td>

      <td>
        ✅
      </td>
    </tr>

    <tr>
      <td>
        <code>transformers.AutoModelForImageTextToText</code>
      </td>

      <td>
        <code>nemo_automodel.NeMoAutoModelForImageTextToText</code>
      </td>

      <td>
        ✅
      </td>
    </tr>

    <tr>
      <td>
        <code>transformers.AutoModelForSequenceClassification</code>
      </td>

      <td>
        <code>nemo_automodel.NeMoAutoModelForSequenceClassification</code>
      </td>

      <td>
        ✅
      </td>
    </tr>

    <tr>
      <td>
        <code>transformers.AutoModelForTextToWaveform</code>
      </td>

      <td>
        <code>nemo_automodel.NeMoAutoModelForTextToWaveform</code>
      </td>

      <td>
        ✅
      </td>
    </tr>

    <tr>
      <td>
        <code>transformers.AutoTokenizer.from_pretrained(...)</code>
      </td>

      <td>
        <code>nemo_automodel.NeMoAutoTokenizer.from_pretrained(...)</code>
      </td>

      <td>
        ✅
      </td>
    </tr>

    <tr>
      <td>
        <code>model.generate(...)</code>
      </td>

      <td>
        <code>model.generate(...)</code>
      </td>

      <td>
        🚧
      </td>
    </tr>

    <tr>
      <td>
        <code>model.save_pretrained(path)</code>
      </td>

      <td>
        <code>model.save_pretrained(path, checkpointer=...)</code>
      </td>

      <td>
        🚧
      </td>
    </tr>
  </tbody>
</table>

## Side-by-Side Examples

### Load a Model and Tokenizer (Transformers)

<table>
  <thead>
    <tr>
      <th>
        🤗 Hugging Face (

        <code>transformers</code>

        )
      </th>

      <th>
        NeMo AutoModel (

        <code>nemo_automodel</code>

        )
      </th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>
        <pre>
          <code>
            {`import torch
            from transformers import AutoModelForCausalLM, AutoTokenizer

            model_id = "gpt2"

            tokenizer = AutoTokenizer.from_pretrained(model_id)
            model = AutoModelForCausalLM.from_pretrained(
                model_id,
                torch_dtype=torch.bfloat16,
            )`}
          </code>
        </pre>
      </td>

      <td>
        <pre>
          <code>
            {`import torch
            from nemo_automodel import NeMoAutoModelForCausalLM, NeMoAutoTokenizer

            model_id = "gpt2"

            tokenizer = NeMoAutoTokenizer.from_pretrained(model_id)
            model = NeMoAutoModelForCausalLM.from_pretrained(
                model_id,
                torch_dtype=torch.bfloat16,
            )`}
          </code>
        </pre>
      </td>
    </tr>
  </tbody>
</table>

### Text Generation

This snippet assumes you already have a `model` and `tokenizer` (see the loading snippet above).

<table>
  <thead>
    <tr>
      <th>
        🤗 Hugging Face (

        <code>transformers</code>

        )
      </th>

      <th>
        NeMo AutoModel (

        <code>nemo_automodel</code>

        )
      </th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>
        <pre>
          <code>
            {`import torch

            prompt = "Write a haiku about GPU kernels."
            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

            with torch.inference_mode():
                out = model.generate(**inputs, max_new_tokens=64)

            print(tokenizer.decode(out[0], skip_special_tokens=True))`}
          </code>
        </pre>
      </td>

      <td>
        <pre>
          <code>
            {`import torch

            prompt = "Write a haiku about GPU kernels."
            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

            with torch.inference_mode():
                out = model.generate(**inputs, max_new_tokens=64)

            print(tokenizer.decode(out[0], skip_special_tokens=True))`}
          </code>
        </pre>
      </td>
    </tr>
  </tbody>
</table>

### Tokenizers (Transformers vs. NeMo AutoModel)

NeMo AutoModel provides `NeMoAutoTokenizer` as a Transformers-like auto-tokenizer with a small registry for specialized backends (and a safe fallback when no specialization is needed).

<table>
  <thead>
    <tr>
      <th>
        🤗 Hugging Face (

        <code>transformers</code>

        )
      </th>

      <th>
        NeMo AutoModel (

        <code>nemo_automodel</code>

        )
      </th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>
        <pre>
          <code>
            {`from transformers import AutoTokenizer

            tok = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")`}
          </code>
        </pre>
      </td>

      <td>
        <pre>
          <code>
            {`from nemo_automodel import NeMoAutoTokenizer

            tok = NeMoAutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")`}
          </code>
        </pre>
      </td>
    </tr>
  </tbody>
</table>

## Checkpoints: Save in NeMo AutoModel, Load Everywhere

NeMo AutoModel training recipes write checkpoints as sharded safetensors by default and generate a per-checkpoint helper that can export Hugging Face-compatible consolidated safetensors after training:

* See [checkpointing](/development/checkpointing) for checkpoint formats and example directory layouts.
* See [model coverage](/model-coverage/overview) for notes on how model support depends on the pinned Transformers version.

If your goal is to **train/fine-tune in NeMo AutoModel → deploy in the HF ecosystem**, the recommended workflow is to keep `model_save_format: safetensors` with either `save_consolidated: final` for final-checkpoint export or `save_consolidated: false` plus `bash <checkpoint>/model/consolidate.sh` after training. Then load `model/consolidated/` with the standard HF APIs or downstream inference engines. Set `save_consolidated: every` (or legacy `true`) only if you want inline HF export at every checkpoint save.