Task-Specific Profiles and Custom Databases#

To choose a database, you can use task-specific profiles or custom databases.

Task-Specific Profiles#

The MSA Search NIM includes task-specific profiles that download only the databases required for specific use cases. This reduces storage requirements and startup time.

Listing Available Profiles#

To list available profiles:

docker run -it --entrypoint list-model-profiles nvcr.io/nim/colabfold/msa-search:2

Example output:

MODEL PROFILES
- Compatible with system:
    - 029cf32ae4c55071a6164ae4d0db24d2de89a4c8629449c3da8f4dd737df2662 - databases:uniref30
    - a9e1283808eab445d31ae9a713eadba42fc015a100c57a1d4f4ecefc47f7cfc7 - databases:all
    - ad5086cc67393792e71fa57444f13eaff8425658e8fb5feea07070ca3b2d34bb - databases:pdb70
    - d162e718a3ae5835c73cf3bff52166b3e095e29c7fd30605debf8280159beab2 - databases:uniref30,pdb70,pdb

Selecting a Profile#

To select a specific profile, use the NIM_MODEL_PROFILE environment variable with the profile hash:

export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

docker run -it --rm \
  --runtime=nvidia \
  --gpus all \
  -e NGC_API_KEY \
  -e NIM_MODEL_PROFILE=029cf32ae4c55071a6164ae4d0db24d2de89a4c8629449c3da8f4dd737df2662 \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -p 8000:8000 \
  nvcr.io/nim/colabfold/msa-search:2

To verify the selected profile after the service starts:

curl localhost:8000/v1/metadata | jq

Available Profiles#

Profile Tags

Databases

Best For

Storage

databases:pdb70

PDB70

Minimal profile with PDB sequences clustered at 70% identity. Useful for quick testing.

~100 MB

databases:uniref30

Uniref30

Paired MSA search for protein complexes. Uniref30 is the only database used for species-based sequence pairing.

~500 GB

databases:uniref30,pdb70,pdb

Uniref30, PDB70, PDB structures

Structural template search. Includes databases for profile building (Uniref30), template search (PDB70), and mmCIF structure files (PDB).

~700 GB

databases:all

Uniref30, ColabFold envdb, PDB70, PDB100, PDB structures

Full MSA search with all available databases. Provides maximum sensitivity and coverage.

~1.2 TB

Note

Profile hashes may change between NIM releases. Always use list-model-profiles to get the current hashes for your version.

Custom Databases#

You can use custom or manually downloaded MMSeqs2 databases with the MSA Search NIM by mounting them into the container and configuring the NIM_MODEL_NAME environment variable.

Note

Custom databases completely replace the default databases. When you set NIM_MODEL_NAME, the NIM will only use databases found in that directory – it will not mix custom databases with the default model profile databases. If you need multiple databases, download them individually from NGC and mount them all under the same NIM_MODEL_NAME directory.

Prerequisites#

Before proceeding, ensure the following prerequisites are met:

  • Your databases are in MMSeqs2 format, created using mmseqs createdb or compatible tools.

  • For GPU Server support (enabled by default), databases must be indexed with mmseqs makepaddedseqdb. Databases downloaded from NGC are already indexed and ready for GPU Server use.

Specifying Custom Databases#

Mount your custom database directory into the Docker container and set the NIM_MODEL_NAME environment variable inside the container pointing to the mounted directory.

Example: Loading a Manually Downloaded Database#

You can test this process by manually downloading one of the MSA Search databases and instructing NIM to load it as a custom database. This example uses PDB70, the smallest database, suitable for quick testing.

  1. Set the required variables and paths:

DATABASE=pdb70_220313-m18v1
DATABASE_PATH="$PWD/msa-search_v$DATABASE/pdb70_220313"
  1. Download the database:

ngc registry model download-version nim/colabfold/msa-search:$DATABASE
  1. Run the NIM with the appropriate mount points and environment variables:

docker run -it --rm \
  --runtime=nvidia \
  --gpus all \
  -e NIM_MODEL_NAME=/databases \
  -v "$DATABASE_PATH":/databases/pdb70_220313:ro \
  -p 8000:8000 \
  nvcr.io/nim/colabfold/msa-search:2
  1. Test the service with a simple query:

curl http://localhost:8000/biology/colabfold/msa-search/predict \
  -H "Content-Type: application/json" \
  -d '{"sequence": "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"}'

Note

If your custom databases are not indexed for GPU Server, see Indexing Databases for GPU Server for instructions on how to create the required index.

Tip

To mount multiple databases, add additional -v flags for each database directory, all under the same NIM_MODEL_NAME: -v "$DB1_PATH":/databases/db1:ro -v "$DB2_PATH":/databases/db2:ro

Available Databases#

The following databases can be downloaded individually from NGC:

Database

NGC Model Version

Description

uniref30_2302

uniref30_2302-m18v1

UniRef30 clustered sequences for profile building

colabfold_envdb_202108

colabfold_envdb_202108-m18v1

ColabFold environmental database

pdb70_220313

pdb70_220313-m18v1

PDB sequences clustered at 70% identity for template search

pdb100_230517

pdb100_230517-m18v1

PDB sequences clustered at 100% identity

pdb_20251028

pdb_20251028_zip-m18v1

mmCIF structure files for template search

Database Directory Structure#

Custom databases should be organized with each database in its own subdirectory under NIM_MODEL_NAME. The NIM automatically discovers databases by scanning for MMSeqs2 database files. For example,

/databases/
├── pdb70_220313/
│   ├── pdb70_220313.dbtype
│   ├── pdb70_220313.idx
│   └── ...
└── other_database/
    └── ...

Database Discovery#

The NIM automatically discovers databases at startup using glob patterns:

Sequence databases: The NIM scans for **/*.idx files (MMSeqs2 index files). Each .idx file found registers a searchable database. The database name is derived from the parent directory name (for example, pdb70_220313/pdb70_220313.idx registers as pdb70_220313).

PDB structure archive: For structural template search, the NIM looks for a **/pdb*.zip file containing mmCIF structure files. This archive is used to retrieve 3D structures for template hits.

Note

The discovery patterns can be customized via environment variables:

  • NIM_MSA_DB_INDEX_PATTERN: Pattern for sequence database discovery (default: **/*.idx)

  • NIM_MSA_PDB_CIFS_ZIP: Pattern for PDB structure archive (default: **/pdb*.zip)

For additional configuration options and environment variables, refer to the Configuration Guide.