Template Processing#
OpenFold2 NIM supports incorporating structural templates into protein structure prediction using mmCIF format strings. This document provides technical details about how template inputs are processed to generate template features.
Note
Starting with version 2.0.0, HHR-based template processing has been removed. If you want to migrate from previous versions, refer to the Migration Guide for detailed instructions.
Overview#
Template processing in OpenFold2 follows two main approaches:
No template input: Template atomic coordinates are not used for structure prediction
Direct template processing: Uses mmCIF format strings provided directly in the request
This document focuses on the technical details of direct template processing, which provides a streamlined and flexible approach to incorporating structural templates.
Template Processing#
This section explains the fundamental concepts of template processing in OpenFold2.
Required Fields#
use_templates:
true
(required to enable template processing)explicit_templates: mmCIF format strings containing structural template data (supports both
mmcif
andmmcif.gz
formats)
Note
The API data model includes a rank
field for template objects. For OpenFold2 monomer use-cases, this field should remain at its default value of -1
.
Processing Pipeline#
Parse mmCIF strings: Extract structural data directly from the provided mmCIF content
Chain selection: Automatically select chains with
_atom_site.label_asym_id=A
from the template structure for featurizationSequence alignment: Align the query sequence to the template chain sequence using Kalign to establish residue correspondence
Sequence identity validation: Ensure at least 90% sequence identity between query and template sequences (based on the shorter sequence length)
Mapping validation: Create query-to-template residue mapping based on the alignment results
Feature extraction: Generate template features using structural information and alignment mapping
Template Feature Composition#
The resulting template features combine structural and alignment information:
Structural information: From mmCIF content (atomic coordinates, residue types)
Sequence alignment: Uses Kalign to align query sequence to template chain sequence
Residue mapping: Query-to-template correspondence based on alignment results
Fixed confidence: Template confidence set to maximum value
Streamlined processing: Direct processing without database dependencies
Use Cases#
Custom template structures not available in curated databases
User-provided experimental or computational structures
Rapid prototyping without database infrastructure dependencies
When structural geometry is more important than alignment statistics
Example Request Structure#
Uncompressed mmCIF format:
{
"sequence": "GGSKENEISHHAKEIERLQKEIERHKQ...",
"use_templates": true,
"explicit_templates": [
{
"name": "my_custom_template",
"format": "mmcif",
"structure": "MMCIF_STRING_CONTENT_HERE",
"rank": -1 # Default value for OpenFold2 monomer use-case
}
]
}
Compressed mmCIF format:
{
"sequence": "GGSKENEISHHAKEIERLQKEIERHKQ...",
"use_templates": true,
"explicit_templates": [
{
"name": "my_compressed_template",
"format": "mmcif.gz",
"structure": b"GZIP_COMPRESSED_MMCIF_BYTES_HERE",
"rank": -1 # Default value for OpenFold2 monomer use-case
}
]
}
Key Technical Features#
Aspect |
Implementation |
---|---|
Data Sources |
mmCIF strings (uncompressed) and mmCIF.gz (gzip-compressed) |
Alignment Method |
Kalign sequence alignment |
Confidence Scoring |
Fixed maximum confidence |
Dependencies |
Kalign binary for alignment |
Error Handling |
mmCIF parsing and alignment validation |
Processing Complexity |
Direct processing with alignment step |
Impact on Structure Prediction#
This section highlights the advantages of the current template processing approach.
Template Processing Advantages#
Flexibility: Accepts any structural template, including novel or experimental structures
Speed: Direct processing without database lookup overhead
Custom structures: Incorporates user-generated models, NMR structures, or computational predictions
Simplicity: No external dependencies beyond mmCIF parsing
Reduced footprint: No large template databases required
Faster startup: Eliminates database loading time
Performance Considerations#
Direct processing provides maximum flexibility and speed with simplified confidence weighting
Fixed confidence ensures consistent template utilization across different structure types
Single-step processing reduces computational overhead and potential failure points
Best Practices#
This section describes the best practices for template processing in OpenFold2.
When to Use Templates#
When you have high-quality structural templates related to your target sequence
For proteins with known homologs or structural domains
When incorporating experimental structures or high-confidence computational models
For improving prediction accuracy in regions with structural similarity
Template Selection Guidelines#
Use templates with good sequence similarity to your target (≥90% sequence identity required)
Ensure mmCIF files contain complete atomic coordinate information
Use templates with high experimental resolution, when available
Consider using multiple complementary templates for better coverage
Use
mmcif.gz
format for large template files to reduce request payload size
Sequence Identity Requirements#
Minimum threshold: 90% sequence identity between query and template sequences
Calculation method: Based on the shorter of the two sequences (query or template)
Validation timing: Performed during the alignment step before feature extraction
Failure handling: Templates below 90% identity are rejected with clear error messages
Technical Implementation Notes#
Template processing in OpenFold2 produces standardized template feature data structures that are fully compatible with the OpenFold2 model. Key implementation details:
Confidence weighting: Uses fixed maximum confidence for consistent template utilization
Alignment method: Kalign-based sequence alignment establishes query-to-template residue correspondence
Error handling: Robust mmCIF parsing and sequence identity validation with clear error messages for debugging
Chain selection: Automatically selects chains with
label_asym_id=A
for featurizationMemory efficiency: Direct processing without intermediate database storage
If you want to migrate from HHR-based workflows, refer to the Migration Guide for detailed transition instructions and code examples.