Llama (Bidirectional) for Reranking
Llama (Bidirectional) for Reranking
NeMo AutoModel provides a bidirectional variant of Meta’s Llama for reranking tasks. Unlike the standard causal (left-to-right) Llama used for text generation, this variant uses bidirectional attention, allowing the query and document to interact across the full sequence before a classification head produces a relevance score.
For the bi-encoder variant, see Llama (Bidirectional) for Embedding.
Available Models
Any Llama checkpoint can be loaded as a bidirectional reranking backbone. The following configurations have been tested:
- Llama 3.2 1B — fast iteration, fits on a single GPU
- Llama 3.1 8B — higher-quality reranking for production use
Reranking Models
The cross-encoder path is used for pairwise relevance scoring and reranking.
Example HF Models
Example Recipes
Try with NeMo AutoModel
1. Install NeMo AutoModel. Refer to the (Installation Guide) for information:
2. Clone the repo to get the example recipes:
3. Run the recipe from inside the repo:
Run with Docker
1. Pull the container and mount a checkpoint directory:
2. Navigate to the AutoModel directory (where the recipes are):
3. Run the recipe:
See the Installation Guide.
Hugging Face Model Cards
NVIDIA trained and released the Llama Nemotron Reranking 1B model, optimized to produce a relevance logit score indicating how well a document matches a given query. The model was fine-tuned with a bidirectional attention mechanism for multilingual and cross-lingual question–answer retrieval, with support for long documents (up to 8,192 tokens).