Overview of the Clara Monitoring Platform

The Clara Monitoring Platform provides GPU, CPU, and disk metrics for executing jobs. The library extends the NVIDIA Runtime System Utils (NRSU), an existing TensorRT library, for GPU related metrics. It also uses an open-source python library for CPU/disk metrics. The table below highlights the metrics provided by each library. This feature is currently only supported with Argo-based pipelines (0.3.0).

Copy
Copied!
            

# Navigate to the clara-reference-app and identify a pipeline with monitoring (like livertumor) # Replace REPLACE-IP-ADDRESS with the ip address of the host (where the clara pipeline will be run) # Create a pipeline with yaml using the cli sed 's/REPLACE-IP-ADDRESS/<host_ip_address>/' clara-reference-app/Pipelines/LiverTumorPipeline/liver-tumor-monitoring.yaml clara create pipeline -p clara-reference-app/Pipelines/LiverTumorPipeline/liver-tumor-monitoring.yaml # Obtain the cluster IP address of elasticearch # Use the create-mapping file to create the metrics mapping needed (ensure python3 & elasticsearch-py are installed) # Start the monitoring server # Use the cli to trigger a job with the pipeline created above kubectl get svc | grep elasticsearch python3 metrics/create-mapping.py --ip <elasticsearch_cluster_ip> --port 9200 ./start_server.sh -i <monitoring interval> -d <elasticsearch_cluster_ip> --port 9200 clara create jobs -p <pipeline_id> -n <job_name> -f <dataset_path> clara start job -j <JOB ID>

The metrics datasource should already be configured in Grafana. Follow these steps to create a new dashboard to view the results:

  1. Click the + icon in the left-hand navbar and select Dashboard. An empty panel will appear with the option to Add Query.

  2. Select the Add Query option and change the queried datastore from default to gpumetrics.

  3. In the query block (Block A), change the metric from count to average and select gpu_utilization__pct.

  4. You might need to adjust the timestamp window, as the data is stored in UTC time. On the top right, select the clock icon with Last 6 hours and select Custom time range. Change the From block to now-24h and the To block to now+24h. Zoom accordingly to view data.

You can add New Panels for visualizing different metrics. To view per-GPU metrics, add index:<GPU_NUMBER> in the Lucene query block (GPU_NUMBER is the index of the GPU on the machine). If the queried datastore is changed to cpumetrics, then CPU related metrics can be visualized.

To create custom pipelines with monitoring, you need to update the pipeline definition .yaml file to contain the monitor-start and monitor-stop stages. The following examples show two liver-tumor pipelines (located at clara-reference-app/Pipelines/LiverTumorPipeline/), the first without monitoring and the second with monitoring:

Copy
Copied!
            

# liver-tumor-pipeline.yaml api-version: 0.4.0 name: liver-tumor-pipeline operators: # dicom reader operator # Input: '/input' mapped directly to the input of the pipeline, which is populated by the DICOM Adaptor. # Output:'/output' for saving converted volume image in MHD format to file whose name # is the same as the DICOM series instance ID. - name: dicom-reader description: Converts DICOM instances into MHD, one file per DICOM series. container: image: clara/dicom-reader tag: latest variables: NVIDIA_CLARA_DCM_TO_FORMAT: nii.gz input: - path: /input output: - path: /output ...

Copy
Copied!
            

#liver-tumor-monitoring.yaml api-version: 0.3.0 name: liver-tumor-pipeline operators: - name: monitor-start description: Start the monitoring service container: image: clara/monitor-server tag: latest command: ["python3", "client.py","--start", "-i", "REPLACE-IP-ADDRESS"] input: - path: /input output: - path: /output # dicom reader operator # Input: '/input' mapped directly to the input of the pipeline, which is populated by the DICOM Adaptor. # Output:'/output' for saving converted volume image in MHD format to file whose name # is the same as the DICOM series instance ID. - name: dicom-reader description: Converts DICOM data into MHD. container: image: clara/dicom-reader tag: latest input: - from: monitor-start path: /monitor_start - path: /input output: - path: /output ... - name: register-dicom-output description: Register converted DICOM instances with Results Service to be sent to external DICOM devices. container: image: clara/register-results tag: latest command: ["python", "register.py", "--agent", "ClaraSCU", "--data", "[\"MYPACS\"]"] input: - from: dicom-writer name: dicom path: /input output: - path: /output - name: monitor-stop description: Stop the monitoring service container: image: clara/monitor-server tag: latest command: ["python3", "client.py","--stop", "-i", "REPLACE-IP-ADDRESS"] input: - from: register-dicom-output path: /reg_output output: - path: /output

Note

The monitor-start and monitor-stop stages can be added at any point in the pipeline definition, and do not have to be limited to the first and last stage.

© Copyright 2018-2021, NVIDIA Corporation. All rights reserved. Last updated on Feb 1, 2023.