API Reference#

This documentation contains the API reference for the MSA Search NIM.

OpenAPI Specification#

You can download or view the OpenAPI specification when the NIM is running:

curl http://localhost:8000/openapi.json

You can also navigate to the interactive API documentation at http://localhost:8000/docs in your browser.

Multiple Sequence Alignment Search#

Endpoint path: /biology/colabfold/msa-search/predict

Request type: POST

When GPU Server is enabled (the default in version 2.0.0 and later), each request’s max_msa_sequences must exactly equal the container’s NIM_GLOBAL_MAX_MSA_DEPTH (default 500). If they differ, the request fails. When GPU Server is disabled (NIM_DISABLE_GPU_SERVER=True), you may use any max_msa_sequences value in the documented range.

Input Parameters#

sequence (string, required): A sequence to search against the MSA databases. Must be a valid protein sequence composed of the 20 standard amino acids plus X for unknown residues (ARNDCQEGHILKMFPSTWYVX). Length: 1-4096 characters.

Example:
```
"SGSMKTAISLPDETFDRVSRRASELGMSRSEFFTKAAQR"
```
databases (list[string], optional): Database names to search against. All databases are searched by default. Database names are case-insensitive; the response preserves the case you specify. Default: ["all"].

Important: For ColabFold search type, the first database in the list is used for profile generation. When using ["all"], uniref30 is automatically placed first.

Examples: ["all"], ["uniref30_2302"], ["uniref30_2302", "pdb70_220313"]
search_type (string, optional): Which type of MSA Search to run for alignment production. Default: "colabfold".

Options:
- "colabfold": Cascaded search with higher sensitivity. The first database is used for profile generation.
- "alphafold2": Single-pass iterative search
Examples: "colabfold", "alphafold2"
e_value (float, optional): The e-value threshold for filtering hits when building the Multiple Sequence Alignment. Sequences with an e-value greater than this are not included in the MSA. Range: 0.0-1.0. Default: 0.0001.
iterations (int, optional): The number of MSA iterations to perform, where more iterations find more distant homologs. Default: 1. Note: For cascaded search (search_type="colabfold"), the number of iterations is fixed to 3 and this parameter is ignored.
max_msa_sequences (int, optional): Maximum sequences per individual database in the response (N). Each database’s result is trimmed to at most N sequences. The merged colabfold entry is not trimmed. It concatenates untrimmed results from all D databases, so its size can be up to D × U, where U > N. The cascaded pipeline first accepts up to max_accept targets (default 100), then computes up to alt_ali alternative alignments per target (default 10), giving U ≤ max_accept × (1 + alt_ali) = 1100 with defaults. Configurable using NIM_MMSEQS_PROFILE_ALIGN_MAX_ACCEPT / NIM_MMSEQS_FOLLOWUP_ALIGN_MAX_ACCEPT and NIM_MMSEQS_PROFILE_ALIGN_ALT_ALI / NIM_MMSEQS_FOLLOWUP_ALIGN_ALT_ALI. Range: 1 to NIM_GLOBAL_MAX_MSA_DEPTH (default 500); when GPU Server is enabled, you must set this field to exactly NIM_GLOBAL_MAX_MSA_DEPTH. For more information, refer to Multiple Sequence Alignment Search.
output_alignment_formats (list[string], optional): The output format of the MSA. Supported formats: "a3m", "fasta". Default: ["a3m"].

Examples: ["a3m"], ["a3m", "fasta"]

Outputs#

alignments (Dictionary[string → Dictionary[string → AlignmentFileRecord]]): The MSA alignments organized by database and format. For example, alignments["uniref30_2302"]["a3m"] contains the A3M alignment for the uniref30 database. For colabfold search type, when multiple databases are searched, an additional colabfold key contains the merged alignment. When only a single database is searched, no colabfold key is present. The merged colabfold alignment concatenates results from all databases, so the query sequence appears once per source database. Unlike per-database entries, which are trimmed to max_msa_sequences N, the merged colabfold entry concatenates untrimmed results from all D databases, so its size can be up to D × U where U > N. The cascaded pipeline first accepts up to max_accept targets (default 100), then computes up to alt_ali alternative alignments per target (default 10), giving U ≤ max_accept × (1 + alt_ali) = 1100 with defaults. Refer to NIM_MMSEQS_PROFILE_ALIGN_MAX_ACCEPT / NIM_MMSEQS_FOLLOWUP_ALIGN_MAX_ACCEPT and NIM_MMSEQS_PROFILE_ALIGN_ALT_ALI / NIM_MMSEQS_FOLLOWUP_ALIGN_ALT_ALI. The merged result is not globally sorted by e-value; sequences from each database are sorted within their block, but blocks are concatenated in database order. The merged colabfold key is provided for compatibility and may be removed in a future release. Avoid using colabfold key, if possible.
metrics (dictionary, optional): Contains information about the response useful for debugging and measuring performance. May be empty or null.

Example response:

{
  "alignments": {
    "uniref30_2302": {
      "a3m": {
        "alignment": ">query\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH\n>UniRef100_UPI0002FEA2E8\nMVLSPADKTNVKAAW...\n",
        "format": "a3m"
      }
    },
    "colabfold": {
      "a3m": {
        "alignment": ">query\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH\n>...\n",
        "format": "a3m"
      }
    }
  },
  "metrics": {}
}

When a single database is selected, the merged colabfold key is omitted; when multiple databases are used, the colabfold entry may appear as described under alignments above.

Paired Multiple Sequence Alignment Search#

Endpoint path: /biology/colabfold/msa-search/paired/predict

Request type: POST

Paired MSA search finds homologous sequences for each chain of a protein complex and pairs them by species, preserving co-evolutionary signals across chains. This is essential for accurate structure prediction of protein complexes.

Input Parameters#

sequences (list[string] or dict[string, string], required): Protein sequences, one per chain. Must contain at least 2 sequences. Each sequence must be composed of the 20 standard amino acids plus X for unknown residues (ARNDCQEGHILKMFPSTWYVX). Can be provided as a list (chain IDs assigned automatically as identifiers such as “A” and “B”) or as a dictionary keyed by chain ID.

Examples:
```
["VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH", "MHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPYTQRFFESFGDLST"]
```
```
{"A": "VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH", "B": "MHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPYTQRFFESFGDLST"}
```
databases (list[string], optional): Databases to search against. Only databases with taxonomy information can be used for paired search. Database names are case-insensitive; the response preserves the case you specify. Default: ["all"].

Examples: ["all"], ["uniref30_2302"]
e_value (float, optional): The e-value threshold for filtering hits when building the Multiple Sequence Alignment. Sequences with an e-value greater than this are not included in the MSA. Range: 0.0-1.0. Default: 0.0001.
max_msa_sequences (int, optional): Maximum sequences per individual database per chain (N). Each database’s result is trimmed to at most N sequences per chain. Only applies when unpack=True; when unpack=False, row count may exceed N as the cascaded search pipeline (expand + align) can produce up to D × U sequences where D is the number of databases and U ≤ max_accept × (1 + alt_ali) = 1100 with defaults (refer to monomer search documentation for details). Range: 1 to NIM_GLOBAL_MAX_MSA_DEPTH (default 500); when GPU Server is enabled, you must set this field to exactly NIM_GLOBAL_MAX_MSA_DEPTH. For more information, refer to Multiple Sequence Alignment Search.
pairing_strategy (string, optional): Pairing strategy for cross-chain sequence matching. Default: "greedy".

Options:
- "greedy": Maximizes rows by including species with partial chain coverage, pairing all chains that have hits and leaving gaps in others.
- "complete": Only includes species where all chains have hits, producing fewer rows but with full coverage.
For 2-chain searches, both strategies produce identical results; differences only arise for three or more chains.
unpack (boolean, optional): Controls output format. Default: True.
- True: Returns N alignments (one per chain) keyed by chain ID (for example, “A”, “B”), each strictly limited to max_msa_sequences rows.
- False: Returns raw MMseqs2 output as a single “all_chains” alignment with chains concatenated using null-byte separators. Row count may exceed NIM_GLOBAL_MAX_MSA_DEPTH due to MMseqs2 internal pipeline behavior (prefilter limits are per-step, not strict global caps).

Outputs#

Paired MSA search returns alignment payloads in A3M format only.

alignments_by_chain (Dictionary[string → Dictionary[string → Dictionary[string → AlignmentFileRecord]]]): Paired MSA alignments organized by chain ID. When unpack=True, structure is {chain_id: {database_name: {format: AlignmentFileRecord}}}. When unpack=False, contains a single “all_chains” key with the concatenated paired alignment.

Example response (with unpack=True):

{
  "alignments_by_chain": {
    "A": {
      "uniref30_2302": {
        "a3m": {
          "alignment": ">A|-|A\nVLSPADKTNVKAAWGKV...\n>UniRef100_UPI00148F070C...\nVLSPAD...",
          "format": "a3m"
        }
      }
    },
    "B": {
      "uniref30_2302": {
        "a3m": {
          "alignment": ">B|-|B\nMHLTPEEKSAVTALWGKV...\n>UniRef100_UPI0008DEA318...\nMHLTPE...",
          "format": "a3m"
        }
      }
    }
  },
  "metrics": {}
}

To access the A3M-formatted alignment for chain A from the uniref30_2302 database:

alignments_by_chain["A"]["uniref30_2302"]["a3m"]["alignment"]

metrics (dictionary, optional): Contains information about the response useful for debugging and measuring performance. May be empty or null.

Structural Template Search#

Endpoint path: /biology/colabfold/msa-search/structure-templates/predict

Request type: POST

Structural template search finds homologous protein structures by searching PDB-based databases and retrieves the corresponding mmCIF structure files. This endpoint combines MSA generation with template discovery in a single request, providing all inputs needed for template-based structure prediction.

Input Parameters#

sequence (string, required): A protein sequence to search against the databases. Must be composed of the 20 standard amino acids plus X for unknown residues (ARNDCQEGHILKMFPSTWYVX). Length: 1-4096 characters.

Example:
```
"SGSMKTAISLPDETFDRVSRRASELGMSRSEFFTKAAQR"
```
structural_template_databases (list[string], optional): List of databases to search for structural templates. Database names are case-insensitive; the response preserves the case you specify. Default: value of NIM_MSA_API_DEFAULT_STRUCTURAL_TEMPLATE_DBS environment variable (typically ["pdb70_220313"]).

Examples: ["pdb70_220313"], ["pdb70_220313", "pdb100_230517"]
msa_databases (list[string], optional): Database names to search for MSA generation. The first database is used for profile generation, which determines template search results. Database names are case-insensitive; the response preserves the case you specify. Default: ["all"].

Examples: ["all"], ["uniref30_2302"]
e_value (float, optional): The e-value threshold for filtering hits. Range: 0.0-1.0. Default: 0.0001.
max_structures (int, optional): Maximum number of PDB structures to return from template search. Default: 20.
max_msa_sequences (int, optional): Maximum sequences per individual database in the response (N). Each database’s result is trimmed to at most N sequences. The merged colabfold entry is not trimmed. It concatenates untrimmed results from all D databases, so its size can be up to D × U, where U > N. The cascaded pipeline first accepts up to max_accept targets (default 100), then computes up to alt_ali alternative alignments per target (default 10), giving U ≤ max_accept × (1 + alt_ali) = 1100 with defaults. Configurable using NIM_MMSEQS_PROFILE_ALIGN_MAX_ACCEPT / NIM_MMSEQS_FOLLOWUP_ALIGN_MAX_ACCEPT and NIM_MMSEQS_PROFILE_ALIGN_ALT_ALI / NIM_MMSEQS_FOLLOWUP_ALIGN_ALT_ALI. Range: 1 to NIM_GLOBAL_MAX_MSA_DEPTH (default 500); when GPU Server is enabled, you must set this field to exactly NIM_GLOBAL_MAX_MSA_DEPTH. For more information, refer to Multiple Sequence Alignment Search.
output_alignment_formats (list[string], optional): Output formats for the MSA fields in alignments (same as standard MSA search). Supported values: "a3m", "fasta". Default: ["a3m"]. Does not affect template hit tables (search_hits, M8) or structure payloads (structures, mmCIF).

Outputs#

The response includes the same fields as the standard MSA search endpoint, plus template-specific outputs.

MSA fields in alignments follow the same output_alignment_formats behavior as Multiple Sequence Alignment Search (default A3M; optional FASTA). Template hit strings in search_hits are always M8 (BLAST tabular); structures in structures are always mmCIF. Template search runs the ColabFold cascaded search approach internally, and the profile from the first MSA database (for example, uniref30_2302) is used to find structural templates.

alignments: MSA alignments organized by database and format (same as standard MSA search). Refer to Outputs above for details on the merged colabfold entry.
search_hits (Dictionary[string → Dictionary[string → SearchHitRecord]]): Structural template hits organized by database. Each entry contains template hits in M8 (BLAST tabular) format.

Each SearchHitRecord contains:
- hits (string): Template hits with columns: query, target, fident, alnlen, mismatch, gapopen, qstart, qend, tstart, tend, evalue, bits, and cigar. The output format can be customized using the NIM_MMSEQS_TEMPLATE_CONVERTALIS_FORMAT environment variable.
- format (string): Always "m8".
structures (Dictionary[string → StructuralTemplate]): Retrieved PDB structures for template hits, organized by PDB ID.

Each StructuralTemplate contains:
- structure (string): The mmCIF file content
- format (string): Always "mmcif"
metrics (dictionary, optional): Performance and debugging information

Example response:

{
  "alignments": {
    "uniref30_2302": {
      "a3m": {
        "alignment": ">query\nMVPSAGQLALF...",
        "format": "a3m"
      }
    },
    "colabfold": {
      "a3m": {
        "alignment": ">query\nMVPSAGQLALF...",
        "format": "a3m"
      }
    }
  },
  "search_hits": {
    "pdb70_220313": {
      "m8": {
        "hits": "query\t1abc_A\t85.0\t150\t...",
        "format": "m8"
      }
    }
  },
  "structures": {
    "1abc": {
      "structure": "data_1ABC\n_entry.id 1ABC\n...",
      "format": "mmcif"
    }
  },
  "metrics": {}
}

Get Database Configuration#

Endpoint path: /biology/colabfold/msa-search/config/msa-database-configs

Request type: GET

Input Parameters#

None.

Outputs#

Returns a list of database configuration objects, each containing:

name (string): The database identifier used in API requests (for example, uniref30_2302).
display_name (string): A human-readable display name suitable for UI presentation (for example, UniProt Reference Clusters (30% identity) 2023-02). Falls back to name for custom databases without a configured display name. Customizable using the NIM_MSA_DB_DISPLAY_NAMES environment variable or by mounting a custom /opt/nim/msa/config.py.
relative_path (string, deprecated): Always N/A. Refer to the /v1/metadata endpoint.
index_relative_path (string, deprecated): Always N/A. Refer to the /v1/metadata endpoint.
ngc_model (string, deprecated): Always N/A. Refer to the /v1/metadata endpoint.

Get MMSeqs2 Version#

Endpoint path: /biology/colabfold/msa-search/mmseqs2/version

Request type: GET

Input Parameters#

None.

Outputs#

mmseqs2_version (string): The version string of the MMSeqs2 installation, typically a git commit hash. Use this value when building custom database indices; indices must be created with the same MMSeqs2 version as the NIM for compatibility.

Health Endpoints#

Readiness Check#

Endpoint path: /v1/health/ready

Request type: GET

Description: Checks if the service is ready to handle requests.

Outputs#

Status code 200: Service is ready
Status code 503: Service is not ready

Response includes a JSON object with:

message (string): Status message
object (string): Always “health.response”
status (string, optional): Status string for backwards compatibility

Liveness Check#

Endpoint path: /v1/health/live

Request type: GET

Description: Checks if the service is live (running).

Outputs#

Status code 200: Service is live
Status code 503: Service is not live

Response format is the same as the readiness check.

Returns metrics in Prometheus format.