Template Processing#

OpenFold2 NIM supports incorporating structural templates into protein structure prediction using mmCIF format strings. This document provides technical details about how template inputs are processed to generate template features.

Note

Starting with version 2.0.0, HHR-based template processing has been removed. If you want to migrate from previous versions, refer to the Migration Guide for detailed instructions.

Overview#

Template processing in OpenFold2 follows two main approaches:

  • No template input: Template atomic coordinates are not used for structure prediction

  • Direct template processing: Uses mmCIF format strings provided directly in the request

This document focuses on the technical details of direct template processing, which provides a streamlined and flexible approach to incorporating structural templates.

Template Processing#

This section explains the fundamental concepts of template processing in OpenFold2.

Required Fields#

  • use_templates: true (required to enable template processing)

  • explicit_templates: mmCIF format strings containing structural template data (supports both mmcif and mmcif.gz formats)

Note

The API data model includes a rank field for template objects. For OpenFold2 monomer use-cases, this field should remain at its default value of -1.

Processing Pipeline#

  1. Parse mmCIF strings: Extract structural data directly from the provided mmCIF content

  2. Chain selection: Automatically select chains with _atom_site.label_asym_id=A from the template structure for featurization

  3. Sequence alignment: Align the query sequence to the template chain sequence using Kalign to establish residue correspondence

  4. Sequence identity validation: Ensure at least 90% sequence identity between query and template sequences (based on the shorter sequence length)

  5. Mapping validation: Create query-to-template residue mapping based on the alignment results

  6. Feature extraction: Generate template features using structural information and alignment mapping

Template Feature Composition#

The resulting template features combine structural and alignment information:

  • Structural information: From mmCIF content (atomic coordinates, residue types)

  • Sequence alignment: Uses Kalign to align query sequence to template chain sequence

  • Residue mapping: Query-to-template correspondence based on alignment results

  • Fixed confidence: Template confidence set to maximum value

  • Streamlined processing: Direct processing without database dependencies

Use Cases#

  • Custom template structures not available in curated databases

  • User-provided experimental or computational structures

  • Rapid prototyping without database infrastructure dependencies

  • When structural geometry is more important than alignment statistics

Example Request Structure#

Uncompressed mmCIF format:

{
    "sequence": "GGSKENEISHHAKEIERLQKEIERHKQ...",
    "use_templates": true,
    "explicit_templates": [
        {
            "name": "my_custom_template",
            "format": "mmcif",
            "structure": "MMCIF_STRING_CONTENT_HERE",
            "rank": -1  # Default value for OpenFold2 monomer use-case
        }
    ]
}

Compressed mmCIF format:

{
    "sequence": "GGSKENEISHHAKEIERLQKEIERHKQ...",
    "use_templates": true,
    "explicit_templates": [
        {
            "name": "my_compressed_template",
            "format": "mmcif.gz",
            "structure": b"GZIP_COMPRESSED_MMCIF_BYTES_HERE",
            "rank": -1  # Default value for OpenFold2 monomer use-case
        }
    ]
}

Key Technical Features#

Aspect

Implementation

Data Sources

mmCIF strings (uncompressed) and mmCIF.gz (gzip-compressed)

Alignment Method

Kalign sequence alignment

Confidence Scoring

Fixed maximum confidence

Dependencies

Kalign binary for alignment

Error Handling

mmCIF parsing and alignment validation

Processing Complexity

Direct processing with alignment step

Impact on Structure Prediction#

This section highlights the advantages of the current template processing approach.

Template Processing Advantages#

  • Flexibility: Accepts any structural template, including novel or experimental structures

  • Speed: Direct processing without database lookup overhead

  • Custom structures: Incorporates user-generated models, NMR structures, or computational predictions

  • Simplicity: No external dependencies beyond mmCIF parsing

  • Reduced footprint: No large template databases required

  • Faster startup: Eliminates database loading time

Performance Considerations#

  • Direct processing provides maximum flexibility and speed with simplified confidence weighting

  • Fixed confidence ensures consistent template utilization across different structure types

  • Single-step processing reduces computational overhead and potential failure points

Best Practices#

This section describes the best practices for template processing in OpenFold2.

When to Use Templates#

  • When you have high-quality structural templates related to your target sequence

  • For proteins with known homologs or structural domains

  • When incorporating experimental structures or high-confidence computational models

  • For improving prediction accuracy in regions with structural similarity

Template Selection Guidelines#

  • Use templates with good sequence similarity to your target (≥90% sequence identity required)

  • Ensure mmCIF files contain complete atomic coordinate information

  • Use templates with high experimental resolution, when available

  • Consider using multiple complementary templates for better coverage

  • Use mmcif.gz format for large template files to reduce request payload size

Sequence Identity Requirements#

  • Minimum threshold: 90% sequence identity between query and template sequences

  • Calculation method: Based on the shorter of the two sequences (query or template)

  • Validation timing: Performed during the alignment step before feature extraction

  • Failure handling: Templates below 90% identity are rejected with clear error messages

Technical Implementation Notes#

Template processing in OpenFold2 produces standardized template feature data structures that are fully compatible with the OpenFold2 model. Key implementation details:

  • Confidence weighting: Uses fixed maximum confidence for consistent template utilization

  • Alignment method: Kalign-based sequence alignment establishes query-to-template residue correspondence

  • Error handling: Robust mmCIF parsing and sequence identity validation with clear error messages for debugging

  • Chain selection: Automatically selects chains with label_asym_id=A for featurization

  • Memory efficiency: Direct processing without intermediate database storage

If you want to migrate from HHR-based workflows, refer to the Migration Guide for detailed transition instructions and code examples.