Observability#

BMD-Specific Metrics#

NIM for BMD exposes operational metrics using the GET /v1/status endpoint. These metrics provide visibility into the molecular dynamics simulation workload.

Aggregate Metrics#

Aggregate metrics track the combined molecular dynamics tasks, API requests, and overall queue depths across all active GPU workers:

Metric

Type

Description

tasks_received

counter

Total molecular dynamics simulation tasks received

tasks_completed

counter

Total tasks completed

queue_size_atoms

gauge

Current number of atoms queued for processing

queue_size_tasks

gauge

Current number of tasks waiting in the queue

requests_received

counter

Total API requests received

requests_completed

counter

Total API requests completed

requests_in_progress

gauge

API requests currently being processed

max_system_size

gauge

Maximum atoms per batch for the current GPU configuration

Per-Worker Metrics#

Each GPU worker reports its own metrics from the following:

Metric

Type

Description

device

string

GPU device identifier (for example, cuda:0)

tasks_received

counter

Tasks received by this worker

tasks_completed

counter

Tasks completed by this worker

queue_size_atoms

gauge

Atoms queued on this worker

queue_size_tasks

gauge

Tasks queued on this worker

batch_size

gauge

Current batch size

max_batch_size

gauge

Maximum batch size for this GPU

gpu_model

string

GPU model name

Query Status#

Query the status endpoint to retrieve metrics:

curl -s http://localhost:8000/v1/status | python3 -m json.tool

For the full response schema and an example, refer to the GET /v1/status endpoint in the API Reference.

Prometheus#

NIM for BMD exposes Prometheus metrics for request statistics at http://localhost:8000/v1/metrics.

To install Prometheus and scrape metrics:

  1. Download the latest Prometheus release for your system:

    $ wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz
    $ tar -xvzf prometheus-2.52.0.linux-amd64.tar.gz
    $ cd prometheus-2.52.0.linux-amd64/
    
  2. Edit the prometheus.yml file to scrape the NIM for BMD endpoint:

    scrape_configs:
      - job_name: "nim-bmd"
        static_configs:
          - targets: ["localhost:8000"]
    
  3. Start the Prometheus server:

    $ ./prometheus --config.file=./prometheus.yml
    
  4. Verify the target in Prometheus by navigating to http://localhost:9090/targets?search=.

Grafana#

Visualize metrics with Grafana.

  1. Install the latest Grafana release for your system:

    $ wget https://dl.grafana.com/oss/release/grafana-11.0.0.linux-amd64.tar.gz
    $ tar -zxvf grafana-11.0.0.linux-amd64.tar.gz
    $ cd grafana-v11.0.0/
    
  2. Start the Grafana server:

    $ ./bin/grafana-server
    
  3. Access the dashboard by navigating to http://localhost:3000 and log in with the default credentials:

    • Username: admin

    • Password: admin

  4. Configure the Prometheus data source:

    • Click Data Source.

    • Select Prometheus.

    • Set the URL to localhost:9090.

    • Save the configuration.