Overview of the Clara Monitoring Platform

The Clara Monitoring Platform provides GPU, CPU, and disk metrics for executing jobs. The library extends the NVIDIA Runtime System Utils (NRSU), an existing TensorRT library, for GPU related metrics. It also uses an open-source python library for CPU/disk metrics. The table below highlights the metrics provided by each library. This feature is currently only supported with Argo-based pipelines (0.3.0).

Getting Started

Copy
Copied!

            
            # Navigate to the clara-reference-app and identify a pipeline with monitoring (like livertumor)
# Replace REPLACE-IP-ADDRESS with the ip address of the host (where the clara pipeline will be run)
# Create a pipeline with yaml using the cli

sed 's/REPLACE-IP-ADDRESS/<host_ip_address>/' clara-reference-app/Pipelines/LiverTumorPipeline/liver-tumor-monitoring.yaml
clara create pipeline -p clara-reference-app/Pipelines/LiverTumorPipeline/liver-tumor-monitoring.yaml

# Obtain the cluster IP address of elasticearch
# Use the create-mapping file to create the metrics mapping needed (ensure python3 & elasticsearch-py are installed)
# Start the monitoring server
# Use the cli to trigger a job with the pipeline created above

kubectl get svc | grep elasticsearch
python3 metrics/create-mapping.py --ip <elasticsearch_cluster_ip> --port 9200
./start_server.sh -i <monitoring interval> -d <elasticsearch_cluster_ip> --port 9200
clara create jobs -p <pipeline_id> -n <job_name> -f <dataset_path>
clara start job -j <JOB ID>

Viewing Results in Grafana

The metrics datasource should already be configured in Grafana. Follow these steps to create a new dashboard to view the results:

Click the + icon in the left-hand navbar and select Dashboard. An empty panel will appear with the option to Add Query.
Select the Add Query option and change the queried datastore from default to gpumetrics.
In the query block (Block A), change the metric from count to average and select gpu_utilization__pct.
You might need to adjust the timestamp window, as the data is stored in UTC time. On the top right, select the clock icon with Last 6 hours and select Custom time range. Change the From block to now-24h and the To block to now+24h. Zoom accordingly to view data.

You can add New Panels for visualizing different metrics. To view per-GPU metrics, add index:<GPU_NUMBER> in the Lucene query block (GPU_NUMBER is the index of the GPU on the machine). If the queried datastore is changed to cpumetrics, then CPU related metrics can be visualized.

Creating New Pipelines

To create custom pipelines with monitoring, you need to update the pipeline definition .yaml file to contain the monitor-start and monitor-stop stages. The following examples show two liver-tumor pipelines (located at clara-reference-app/Pipelines/LiverTumorPipeline/), the first without monitoring and the second with monitoring:

Copy
Copied!

            
            # liver-tumor-pipeline.yaml
api-version: 0.4.0
name: liver-tumor-pipeline
operators:
# dicom reader operator
# Input: '/input' mapped directly to the input of the pipeline, which is populated by the DICOM Adaptor.
# Output:'/output' for saving converted volume image in MHD format to file whose name
#            is the same as the DICOM series instance ID.
- name: dicom-reader
  description: Converts DICOM instances into MHD, one file per DICOM series.
  container:
    image: clara/dicom-reader
    tag: latest
  variables:
    NVIDIA_CLARA_DCM_TO_FORMAT: nii.gz
  input:
  - path: /input
  output:
  - path: /output
  ...

Copy
Copied!

            
            #liver-tumor-monitoring.yaml
api-version: 0.3.0
name: liver-tumor-pipeline
operators:
- name: monitor-start
  description: Start the monitoring service
  container:
    image: clara/monitor-server
    tag: latest
    command: ["python3", "client.py","--start", "-i", "REPLACE-IP-ADDRESS"]
  input:
  - path: /input
  output:
  - path: /output
# dicom reader operator
# Input: '/input' mapped directly to the input of the pipeline, which is populated by the DICOM Adaptor.
# Output:'/output' for saving converted volume image in MHD format to file whose name
#            is the same as the DICOM series instance ID.
- name: dicom-reader
  description: Converts DICOM data into MHD.
  container:
    image: clara/dicom-reader
    tag: latest
  input:
  - from: monitor-start
    path: /monitor_start
  - path: /input
  output:
  - path: /output

  ...

  - name: register-dicom-output
  description: Register converted DICOM instances with Results Service to be sent to external DICOM devices.
  container:
    image: clara/register-results
    tag: latest
    command: ["python", "register.py", "--agent", "ClaraSCU", "--data", "[\"MYPACS\"]"]
  input:
  - from: dicom-writer
    name: dicom
    path: /input
  output:
    - path: /output
- name: monitor-stop
  description: Stop the monitoring service
  container:
    image: clara/monitor-server
    tag: latest
    command: ["python3", "client.py","--stop", "-i", "REPLACE-IP-ADDRESS"]
  input:
  - from: register-dicom-output
    path: /reg_output
  output:
  - path: /output

Note

The monitor-start and monitor-stop stages can be added at any point in the pipeline definition, and do not have to be limited to the first and last stage.