Evo 2 NIM endpoints#

The NIM model provides endpoints that generates DNA sequences, runs model forward pass and saves layer outputs, and conducts readiness checks. The input and output parameters of these endpoints correspond to the properties in a JSON object that the endpoint submits or receives.

Generate DNA sequences#

Endpoint path: /biology/arc/evo2/generate

Input parameters#

sequence (string): Required. Sequence data of the DNA.
num_tokens (integer, null): Optional (default: 100). Number of tokens to be generated.
temperature (number, null): Optional (default: 0.7). Scale of randomness in the temperature sampling process. Values lower than 1.0 generates a sharper distribution, which is less random. Values higher than 1.0 generates a uniform distribution, which is more random.
top_k (integer, null): Optional (default: 3). Specifies the number of highest probability tokens to consider. When set to 1, it selects only the token with the highest probability. The higher the values are set, the more diverse the sampling will be. If set to 0, all tokens are considered.
top_p (number, null): Optional (default: 1.0). This parameter specifies the top-p threshold number, between 0 and 1, that enables nucleus sampling. When cumulative probability of the smallest possible set of tokens exceeds the top_p threshold, it filters out the rest of the tokens. Setting this to 0.0 disables top-p sampling.
random_seed (integer, null): Optional. Turns the Evo 2 model into a deterministic model, where an input DNA and a fixed seed always produces the same output. This argument should only be used for development purposes.
enable_logits (boolean): Optional (default: False). Enables or disables Logits reporting in the output response.
enable_sampled_probs (boolean): Optional (default: False). Enables or disables the reporting of sampled token probabilities. When enabled, generates a list of probability values, between 0 and 1, corresponding to each token in the output sequence. These probabilities represent the model’s confidence each token selection during the generation process. The resulting list has the same length as the output sequence, which provides insight into the model’s decision-making at each step of text generation.
enable_elapsed_ms_per_token (boolean): Optional (default: False). Enables or disables the reporting of per-token timing statistics, which is used for benchmarking.

Outputs#

sequence (string): Required. This output contains the generated DNA sequence.
logits (array, null): Optional. This outputs Logits report in a [num_tokens, 512] format if enabled in enable_logits input.
sampled_probs (array, null): Optional. This outputs a list of probabilities that corresponds to each token in the generated output sequence. Each value ranges from 0 to 1, representing the model’s confidence in selecting specific tokens during the generation process. The list length matches the output sequence length. To get this output, enable_sampled_probs must be set to True. This information provides insight into the model’s decision-making at each step of text generation.
elapsed_ms (integer): Required. This outputs the amount of time elapsed in milliseconds on server side.
elapsed_ms_per_token (array, null): Optional. This outputs the amount of time elapsed in milliseconds on server side for each generated token.

Run model forward pass and save layers outputs#

Endpoint path: /biology/arc/evo2/forward

Input parameters#

sequence (string): Required. Sequence data of the DNA.
output_layers (array): Required. List of layer names from which to capture and save output tensors. The following layers are available for the output_layers parameter:
- embedding_layer: Input token embeddings (typically layer 0). Note: This refers to the static token embeddings before any model computation. This layer is rarely useful for downstream analysis. Instead, it is recommended to use the output of an intermediate block or the final MLP layer of a block (as listed below), since these actually capture context-dependent features.
- blocks.[n].filter: Output of the filter submodule in block [n] Note: Only available in certain blocks:
  - For 7B models: All blocks except 3, 10, 17, 24, 31
  - For 40B models: All blocks except 3, 10, 17, 24, 31, 35, 42, 49
- blocks.[n].mlp.l{1,2,3}: First, second, and third MLP layers in block [n]
- blocks.[n].inner_mha_cls: Multi-head attention output in block [n] Note: Only available in certain blocks:
  - For 7B models: blocks 3, 10, 17, 24, 31
  - For 40B models: blocks 3, 10, 17, 24, 31, 35, 42, 49
- blocks.[n].norm: Layer normalization in block [n]
- norm: Final model layer normalization
- unembed: Final output/logits of shape [batch_size, seq_len, 512], where 512 is the padded vocabulary size of the tokenizer.
Where [n] is the block index:
- For 7B models: 0 to 31
- For 40B models: 0 to 49
The multi-head attention blocks (inner_mha_cls) have following submodules:
- Wqkv: Query, key, and value projection weights combined
- inner_attn: The self-attention mechanism
- inner_cross_attn: The cross-attention mechanism
- out_proj: Output projection layer
- rotary_emb: Rotary positional embeddings applied to queries and keys
For example: ["unembed", "blocks.20.mlp.l3"].

Outputs#

data (string): Required. This outputs the tensors of requested layers in the NumPy Zipped (NPZ) format that is Base64 encoded.
elapsed_ms (integer): Required. This outputs the amount of time elapsed in milliseconds on server side.

Readiness check#

Endpoint path: /v1/health/ready

Input parameters#

None.

Outputs#

The output of the endpoint is a JSON response with a value that indicates the readiness of the microservice. When the NIM is ready, it returns the response {"status":"ready"}.