Overview of the Clara Monitoring Platform
The Clara Monitoring Platform provides GPU, CPU, and disk metrics for executing jobs. The library extends the NVIDIA Runtime System Utils (NRSU), an existing TensorRT library, for GPU related metrics. It also uses an open-source python library for CPU/disk metrics. The table below highlights the metrics provided by each library. This feature is currently only supported with Argo-based pipelines (0.3.0).
# Navigate to the clara-reference-app and identify a pipeline with monitoring (like livertumor) # Replace REPLACE-IP-ADDRESS with the ip address of the host (where the clara pipeline will be run) # Create a pipeline with yaml using the cli sed 's/REPLACE-IP-ADDRESS/<host_ip_address>/' clara-reference-app/Pipelines/LiverTumorPipeline/liver-tumor-monitoring.yaml clara create pipeline -p clara-reference-app/Pipelines/LiverTumorPipeline/liver-tumor-monitoring.yaml # Obtain the cluster IP address of elasticearch # Use the create-mapping file to create the metrics mapping needed (ensure python3 & elasticsearch-py are installed) # Start the monitoring server # Use the cli to trigger a job with the pipeline created above kubectl get svc | grep elasticsearch python3 metrics/create-mapping.py --ip <elasticsearch_cluster_ip> --port 9200 ./start_server.sh -i <monitoring interval> -d <elasticsearch_cluster_ip> --port 9200 clara create jobs -p <pipeline_id> -n <job_name> -f <dataset_path> clara start job -j <JOB ID>
The metrics datasource should already be configured in Grafana. Follow these steps to create a new dashboard to view the results:
+icon in the left-hand navbar and select
Dashboard. An empty panel will appear with the option to
Add Queryoption and change the queried datastore from
In the query block (Block
A), change the metric from
You might need to adjust the timestamp window, as the data is stored in UTC time. On the top right, select the clock icon with
Last 6 hoursand select
Custom time range. Change the
now+24h. Zoom accordingly to view data.
You can add New Panels for visualizing different metrics. To view per-GPU metrics, add
Lucene query block (
GPU_NUMBER is the index of the GPU on the machine). If the queried
datastore is changed to
cpumetrics, then CPU related metrics can be visualized.
To create custom pipelines with monitoring, you need to update the pipeline definition .yaml file to
monitor-stop stages. The following examples show two
liver-tumor pipelines (located at
clara-reference-app/Pipelines/LiverTumorPipeline/), the first
without monitoring and the second with monitoring:
# liver-tumor-pipeline.yaml api-version: 0.4.0 name: liver-tumor-pipeline operators: # dicom reader operator # Input: '/input' mapped directly to the input of the pipeline, which is populated by the DICOM Adaptor. # Output:'/output' for saving converted volume image in MHD format to file whose name # is the same as the DICOM series instance ID. - name: dicom-reader description: Converts DICOM instances into MHD, one file per DICOM series. container: image: clara/dicom-reader tag: latest variables: NVIDIA_CLARA_DCM_TO_FORMAT: nii.gz input: - path: /input output: - path: /output ...
#liver-tumor-monitoring.yaml api-version: 0.3.0 name: liver-tumor-pipeline operators: - name: monitor-start description: Start the monitoring service container: image: clara/monitor-server tag: latest command: ["python3", "client.py","--start", "-i", "REPLACE-IP-ADDRESS"] input: - path: /input output: - path: /output # dicom reader operator # Input: '/input' mapped directly to the input of the pipeline, which is populated by the DICOM Adaptor. # Output:'/output' for saving converted volume image in MHD format to file whose name # is the same as the DICOM series instance ID. - name: dicom-reader description: Converts DICOM data into MHD. container: image: clara/dicom-reader tag: latest input: - from: monitor-start path: /monitor_start - path: /input output: - path: /output ... - name: register-dicom-output description: Register converted DICOM instances with Results Service to be sent to external DICOM devices. container: image: clara/register-results tag: latest command: ["python", "register.py", "--agent", "ClaraSCU", "--data", "[\"MYPACS\"]"] input: - from: dicom-writer name: dicom path: /input output: - path: /output - name: monitor-stop description: Stop the monitoring service container: image: clara/monitor-server tag: latest command: ["python3", "client.py","--stop", "-i", "REPLACE-IP-ADDRESS"] input: - from: register-dicom-output path: /reg_output output: - path: /output
monitor-stop stages can be added at any point in the pipeline definition, and do not have to be limited to the first and last stage.