Production Deployment Setup

This guide shows the steps to setup the foundational Kubernetes (K8s) environment for the production deployment of the following reference applications in K8s mode:

Prerequisites

Developer Preview Sign-up

  • Go to Metropolis Microservices product page.

  • Select Enterprise GPU (x86 platform) option.

  • Fill out a short from.

  • Review the Developer Preview approval email & NGC invitation emails (typically within a week) for instructions.

Hardware Recommendations

  • RAM: 180GB

  • CPU: 32 cores

  • GPU Type: A100 or better (since the default models for K8s deployment are Transformer-based). GPU usage can change based on the model used; performance metrics for different ones and how to change them are available in the KPI and Customization page, respectively, for each reference workflow/app.

  • GPU: 4 GPUs (MTMC or RTLS or OA app); 1 GPU (FSL app)

Software Requirements

  • Ubuntu 22.04

  • NVIDIA Driver v535.104.12

  • NVIDIA GPU Operator v23.3.2

  • Kubernetes v1.27.1

  • Helm 3.8+

Note

  • Multi-Camera Tracking, Real Time Location System (RTLS) and Occupancy Analytics (OA) apps use the GPU Operator to schedule pods for microservices that require GPU for the app functionality.

  • Multi-Camera Tracking app uses GPU for the following microservices: VST (for bbox overlay & WebRTC), and Perception (for DeepStream pipeline with multi-workload (3 workloads are configured as part of app installation)).

  • OA app uses GPU for the following microservices: Behavior Learning (for learning patterns in object behaviors), Triton (for inferencing), VST, and Perception (for DeepStream pipeline).

  • RTLS app uses GPU for the following microservices: VST (for bbox overlay & WebRTC), and Perception (for DeepStream pipeline with multi-workload (3 workloads are configured as part of app installation)).

  • Some of the microservice pods will not fully utilize their GPUs. Except the Perception microservice, which requires dedicated GPU allocation, others can share GPU.

  • GPU sharing can be enabled by using NVIDIA_VISIBLE_DEVICES to assign same GPU-ID to different pods and ensure that it doesn’t overlap with DS GPU_ID. GPU_Operator allocation for pod GPU assignment has to be disabled.

Install Kubernetes Systems

We recommend installing Kubernetes Systems via NVIDIA Cloud Native Stack by following the instructions here.

NGC Access

Once we have system configured with K8 clusters installed via NVIDIA Cloud Native Stack installation, follow the steps below to create image pull secrets to pull the docker containers from NVIDIA’s private NGC registry:

kubectl create secret docker-registry ngc-docker-reg-secret --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=<NGC_API_KEY>
kubectl create secret generic ngc-api-key-secret --from-literal=NGC_CLI_API_KEY=<NGC_API_KEY>

To install the software with helm, you’ll need access to NGC.

  • Org:

    • nv-mdx (prod-artifacts)

    • nv-media-service (Media Service Prod NGC Repo “rxczgrvsg8nx”)

  • Prod Team:

    • mdx-v2-0

  • Media Service Team:

    • vst-1-0

Deploy Foundational Systems & Monitoring

Attention

If you’re migrating from a previous version, please refer to the Upgrade Guide for additional instructions & suggestions.

  1. To download the latest packages for K8s Deployment and Sample Input Data, there are two approaches.

    • Approach 1: NGC User Interface (UI)

    Log into NGC portal and select nv-mdx Org & mdx-v2-0 Team.

    1. Download the latest K8s Deployment package: application-helm-configs-<version>.tar.gz. Make sure you log into NGC and select nv-mdx Org & mdx-v2-0 Team.

    2. Download the latest Sample Input Data package: metropolis-apps-data-<version>.tar.gz (placed in the same folder).

    • Approach 2: Download and install the NGC CLI tool and execute the following commands:

    While using ngc config, use nv-mdx Org & mdx-v2-0 Team. Also, obtain the latest version of K8s Deployment and Sample Input Data package from NGC UI and update the commands accordingly.

    1. For the latest K8s Deployment package:

    $ ngc registry resource download-version "nfgnkvuikvjm/mdx-v2-0/metropolis-apps-k8s-deployment:<version>-<mmddyyyy>"
    
    1. For the Sample Input Data:

    $ ngc registry resource download-version "nfgnkvuikvjm/mdx-v2-0/metropolis-apps-sample-input-data:<version>-<mmddyyyy>"
    
  2. Extract the contents of the application-helm-configs.tar.gz file by running:

    $ tar xvf application-helm-configs.tar.gz
    

    This will create the following application-helm-configs directory, whose contents will be used in installation stage of Helm charts:

    application-helm-configs/
    ├── FSL
    │  ├── deepstream-fsl-values.yaml
    │  ├── elasticsearch.yaml
    │  └── fsl-app-values.yaml
    ├── LICENSE.md
    ├── MTMC
    │  ├── calibration.json
    │  ├── images
    │  │  ├── building=Nvidia-Bldg-K-Map.png
    │  │  └── imagesMetadata.json
    │  ├── mtmc-app-override-values.yaml
    │  ├── mtmc_kibana_objects.ndjson
    │  ├── nvstreamer-with-ingress-values.yaml
    │  ├── vst-app-edge-with-ingress-values.yaml
    │  ├── vst-app-with-ingress-values.yaml
    │  └── wdm-deepstream-mtmc-values.yaml
    ├── RTLS
    │  ├── calibration.json
    │  ├── images
    │  │  ├── building=Retail-Store-Map.png
    │  │  └── imagesMetadata.json
    │  ├── rtls-app-override-values.yaml
    │  ├── rtls_kibana_objects.ndjson
    │  ├── nvstreamer-with-ingress-values.yaml
    │  ├── vst-app-edge-with-ingress-values.yaml
    │  ├── vst-app-with-ingress-values.yaml
    │  └── wdm-deepstream-rtls-values.yaml
    ├── foundational-sys
    │  └── foundational-sys-monitoring-override-values.yaml
    ├── notebook-calibration
    │  └── calib-notebook-app-values.yaml
    └── people-analytics
            ├── calibration.json
            ├── images
            │  ├── Endeavor_Cafeteria.png
            │  ├── Nth_Street_Cafe_Entrance.png
            │  └── imageMetadata.json
            ├── nvstreamer-with-ingress-values.yaml
            ├── people-analytics-app-override-values.yaml
            ├── people_analytics_kibana_objects.ndjson
            ├── vst-app-edge-with-ingress-values.yaml
            ├── vst-app-with-ingress-values.yaml
            └── wdm-deepstream-ppl-values.yaml
    
  3. Extract the contents of the metropolis-apps-data-<version>.tar.gz file by running:

    $ tar xvf metropolis-apps-data.tar.gz
    

    This will create the following metropolis-apps-data directory:

    metropolis-apps-data
            ├── data_log
            │   ├── behavior_learning_data
            │   ├── calibration_toolkit
            │   ├── elastic
            │   ├── kafka
            │   └── zookeeper
            │       ├── data
            │       └── log
            ├── playback
            └── videos
                    ├── mtmc-app
                    ├── rtls-app
                    ├── people-analytics-app
                    └── heatmap-app
    
  4. Deploy foundational systems, used for storage and monitoring. Generally done only once.

    helm install mdx-foundation-sys-svcs --wait https://helm.ngc.nvidia.com/nfgnkvuikvjm/mdx-v2-0/charts/mdx-foundation-sys-svcs-v1.3.tgz --username='$oauthtoken' --password=YOUR_API_KEY -f application-helm-configs/foundational-sys/foundational-sys-monitoring-override-values.yaml
    

This process will create:

  1. Nginx ingress controller class

    • nginx

  2. Storage classes

    • mdx-local-path - Rancher local path provisioner (RWO)

    • mdx-nfs - nfs provisioner

    • mdx-hostpath - Hostpath provisioner

    • mdx-kube-prometheus-stack - Monitoring

Note

  • Since we are using --wait, installation will take sometime to complete until all pods are up.

  • Grafana Dashboard UI will be available at http://<k8s_node_ip>:32300.

  • Grafana UI password can be changed in override values file foundational-sys-monitoring-override-values.yaml present in the folder application-helm-configs, new password value can be configured for param adminPassword.

  • Monitoring chart will have Grafana data coming from Prometheus.

  • Dashboards are pre-configured with gpu-monitoring and K8s related resources utilizations.

Now you have your Kubernetes Systems configured and you are ready to deploy the following reference apps:

Verify Deployment Installation

kubectl get pods -owide
kubectl get pods -owide | grep -v 'Compl\|Runn' <<=== to check if any pod is not running or failing.

Troubleshoot Pod/Deployment Failures

Steps to follow:

  • Check the events for the failed/crashed pods: kubectl describe pod <Failed_Pod_Name>.

  • View logs of failed pods to find failure error using: kubectl logs -f <Failed_Pod_Name>.

  • View logs of a specific container inside a pod using kubectl logs -f <Failed_Pod_Name> -c <failed_pod_container_name> (Container name can be obtained it will list all the containers name running for a pod kubectl describe pod <Failed_Pod_Name>).

  • If pod is not running due to k8s scheduling then events will show failure errors. Also if pod is crashing then logs for a pod/container why it failed to start.