Overview of the Clara Monitoring Platform
The Clara Monitoring Platform provides GPU, CPU, and disk metrics for executing jobs. The library extends the NVIDIA Runtime System Utils (NRSU), an existing TensorRT library, for GPU related metrics. It also uses an open-source python library for CPU/disk metrics. The table below highlights the metrics provided by each library. This feature is currently only supported with Argo-based pipelines (0.3.0).
# Navigate to the clara-reference-app and identify a pipeline with monitoring (like livertumor)
# Replace REPLACE-IP-ADDRESS with the ip address of the host (where the clara pipeline will be run)
# Create a pipeline with yaml using the cli
sed 's/REPLACE-IP-ADDRESS/<host_ip_address>/' clara-reference-app/Pipelines/LiverTumorPipeline/liver-tumor-monitoring.yaml
clara create pipeline -p clara-reference-app/Pipelines/LiverTumorPipeline/liver-tumor-monitoring.yaml
# Obtain the cluster IP address of elasticearch
# Use the create-mapping file to create the metrics mapping needed (ensure python3 & elasticsearch-py are installed)
# Start the monitoring server
# Use the cli to trigger a job with the pipeline created above
kubectl get svc | grep elasticsearch
python3 metrics/create-mapping.py --ip <elasticsearch_cluster_ip> --port 9200
./start_server.sh -i <monitoring interval> -d <elasticsearch_cluster_ip> --port 9200
clara create jobs -p <pipeline_id> -n <job_name> -f <dataset_path>
clara start job -j <JOB ID>
The metrics datasource should already be configured in Grafana. Follow these steps to create a new dashboard to view the results:
Click the
+
icon in the left-hand navbar and selectDashboard
. An empty panel will appear with the option toAdd Query
.Select the
Add Query
option and change the queried datastore fromdefault
togpumetrics
.In the query block (Block
A
), change the metric fromcount
toaverage
and selectgpu_utilization__pct
.You might need to adjust the timestamp window, as the data is stored in UTC time. On the top right, select the clock icon with
Last 6 hours
and selectCustom time range
. Change theFrom
block tonow-24h
and theTo
block tonow+24h
. Zoom accordingly to view data.
You can add New Panels for visualizing different metrics. To view per-GPU metrics, add index:<GPU_NUMBER>
in the Lucene query
block (GPU_NUMBER
is the index of the GPU on the machine). If the queried
datastore is changed to cpumetrics
, then CPU related metrics can be visualized.
To create custom pipelines with monitoring, you need to update the pipeline definition .yaml file to
contain the monitor-start
and monitor-stop
stages. The following examples show two
liver-tumor pipelines (located at clara-reference-app/Pipelines/LiverTumorPipeline/
), the first
without monitoring and the second with monitoring:
# liver-tumor-pipeline.yaml
api-version: 0.4.0
name: liver-tumor-pipeline
operators:
# dicom reader operator
# Input: '/input' mapped directly to the input of the pipeline, which is populated by the DICOM Adaptor.
# Output:'/output' for saving converted volume image in MHD format to file whose name
# is the same as the DICOM series instance ID.
- name: dicom-reader
description: Converts DICOM instances into MHD, one file per DICOM series.
container:
image: clara/dicom-reader
tag: latest
variables:
NVIDIA_CLARA_DCM_TO_FORMAT: nii.gz
input:
- path: /input
output:
- path: /output
...
#liver-tumor-monitoring.yaml
api-version: 0.3.0
name: liver-tumor-pipeline
operators:
- name: monitor-start
description: Start the monitoring service
container:
image: clara/monitor-server
tag: latest
command: ["python3", "client.py","--start", "-i", "REPLACE-IP-ADDRESS"]
input:
- path: /input
output:
- path: /output
# dicom reader operator
# Input: '/input' mapped directly to the input of the pipeline, which is populated by the DICOM Adaptor.
# Output:'/output' for saving converted volume image in MHD format to file whose name
# is the same as the DICOM series instance ID.
- name: dicom-reader
description: Converts DICOM data into MHD.
container:
image: clara/dicom-reader
tag: latest
input:
- from: monitor-start
path: /monitor_start
- path: /input
output:
- path: /output
...
- name: register-dicom-output
description: Register converted DICOM instances with Results Service to be sent to external DICOM devices.
container:
image: clara/register-results
tag: latest
command: ["python", "register.py", "--agent", "ClaraSCU", "--data", "[\"MYPACS\"]"]
input:
- from: dicom-writer
name: dicom
path: /input
output:
- path: /output
- name: monitor-stop
description: Stop the monitoring service
container:
image: clara/monitor-server
tag: latest
command: ["python3", "client.py","--stop", "-i", "REPLACE-IP-ADDRESS"]
input:
- from: register-dicom-output
path: /reg_output
output:
- path: /output
The monitor-start
and monitor-stop
stages can be added at any point in the pipeline definition, and do not have to be limited to the first and last stage.