Benchmarking with Evo 2 NIM#

Accuracy#

Evo 2 performs a variety of downstream tasks related to DNA sequence generation. These tasks involve generating new DNA sequences or conducting forward passes through the model while capturing specific layer outputs or embeddings. Embeddings are representations of data the model learns at certain layers.

A sequence identity benchmark verifies that the model loads correctly and functions as expected. This benchmark evaluates Evo 2’s ability to restore DNA sequences by using a set of highly conserved DNA sequences. Highly conserved sequences are segments of DNA that remain relatively unchanged across species due to their essential biological functions.

The benchmarking process splits these DNA sequences at their midpoint and prompts the Evo 2 model to predict 100 nucleotides of the sequence. The model’s performance is assessed by calculating the percentage of nucleotides it predicts correctly, which is reported as sequence identity. For verification to be successful, the mean sequence identity should exceed 80%.

Performance#

Note

In this section, we use tokens, nucleotides, and the nucleotide abbreviation “nt” interchangeably.

Evo 2’s performance primarily depends on two factors:

Context length: The length of the input DNA sequence.
Number of tokens: The quantity of nucleotides Evo 2 is tasked to generate.

Separate measurements are conducted for various sequence lengths to report the performance metric as nucleotides (nt) per second at each given length. The aggregate (mean) nucleotides per second metric (throughput) for a sequence length of 8192 input nucleotides should exceed the following values for the 40B model and 7B model:

Model	GPU	GPU Memory (GB)	Precision	# of GPUs	PST^	Throughput
40B	H100	80	Mixed	2	512 nt	26 nt/sec
40B	H200	141	Mixed	1	4096 nt	33 nt/sec
40B	H200	141	Mixed	2	8192 nt	33 nt/sec
7B	RTX 6000 Ada	80	Mixed	1	3072 nt	40 nt/sec
7B	L40S	80	Mixed	1	3072 nt	40 nt/sec
7B	L40S	80	Mixed	2	4096 nt	40 nt/sec
7B	H100	80	Mixed	1	8192 nt	45 nt/sec
7B	H200	141	Mixed	1	8192 nt	52 nt/sec

Note

^ PST (Prompt Segmentation Threshold)

This feature helps manage memory usage for very long prompts. When the input prompt length exceeds this threshold, the generation process is split into three phases:

One large forward pass of input tokens up to the threshold value.
The rest of the prompt that exceed the threshold are processed token-by-token without sampling. This operation executes at the token generation speed (throughput) as shown.
Regular generation, where after the input prompt is fully processed, normal token generation with sampling resumes.

Note

Mixed precision means that the model features layers with FP32, BF16 and FP8 floating-point precision tensors.

Sample benchmarking scripts#

This NIM includes benchmarking script that can measure both accuracy and performance.

The script is packaged in the NIM’s docker image. To view and study the benchmark, run the following command:

docker run --entrypoint cat nvcr.io/nim/arc/evo2:2 /opt/nim/benchmarking.py

To execute the benchmark:

Make sure NIM is running as described in Quickstart Guide.
Execute the benchmark by running the following command:

docker run -it --net host --entrypoint "" \
    nvcr.io/nim/arc/evo2:2 \
    /opt/nim/benchmarking.py --benchmark-type both