Kubernetes Deployment

This section addresses configuration steps to be performed on the BCM head node.

  1. In the root shell, run the Kubernetes setup script.

    cm-kubernetes-setup
    
  2. Select Deploy to continue

    _images/kube-deploy-01.png
  3. Select the newest version certified for NVIDIA AI Enterprise.

    _images/kube-deploy-02.png
  4. (Optional) Enter a private registry server here if required.

    _images/kube-deploy-03.png
  5. Keep the default settings for the cluster.

    _images/kube-deploy-04.png
  6. Select yes to allow the cluster to be used from the headnode.

    _images/kube-deploy-05.png
  7. Select internalnet since that is what the Kubernetes control nodes use.

    _images/kube-deploy-06.png
  8. Select the 3 knodes for the control plane.

    _images/kube-deploy-07.png
  9. Select the dgx-h100 category for Kubernetes workers.

    _images/kube-deploy-09.png
  10. Skip selecting individual worker nodes.

    _images/kube-deploy-10.png
  11. Select the 3 knodes to be Etcd nodes.

    _images/kube-deploy-11.png
  12. Accept the default values.

    _images/kube-deploy-12.png
  13. Select the Calico plugin.

    _images/kube-deploy-13.png
  14. Select no for installing the Kyverno Policy Engine.

    _images/kube-deploy-14.png
  15. (Optional) If an NVAIE license has been provided, select yes and enter the details on the following page. Otherwise, select no.

    _images/kube-deploy-14-2.png
  16. Select the following operators to install:

    • NVIDIA GPU Operator

    • Network Operator

    • Prometheus Adapter

    • Prometheus Operator Stack

    • cm-jupyter-kernel-operator

    • cm-kubernetes-mpi-operator

    _images/kube-deploy-15.png
  17. Select the latest version certified for NVIDIA AI Enterprise.

    _images/kube-deploy-16.png
  18. Select the latest version certified for NVIDIA AI Enterprise.

    _images/kube-deploy-17.png
  19. Leave the custom YAML file blank.

    _images/kube-deploy-18.png
  20. Select both cdi.enabled and nfd.enabled.

    _images/kube-deploy-18-2.png
  21. Leave the custom YAML file blank.

    _images/kube-deploy-18-3.png
  22. Configure the Network Operator by selecting the following:

    • nfd.enabled

    • sriovNetworkOperator.enabled

    • deployCR

    • secondaryNetwork.deploy

    • secondaryNetwork.cniPlugins.deploy

    • secondaryNetwork.multus.deploy

    • secondaryNetwork.ipamPlugin.deploy

    _images/kube-deploy-19.png
  23. Deploy all the addons.

    _images/kube-deploy-20.png
  24. Leave the ports as default.

    _images/kube-deploy-21.png
  25. Choose yes to install the Permission Manager.

    _images/kube-deploy-22.png
  26. Select both enabled and default for the local storage path.

    _images/kube-deploy-23.png
  27. Leave the storage path as default.

    _images/kube-deploy-24.png
  28. Choose to save config and deploy.

    _images/kube-deploy-25.png
  29. Keep the default file path for the config file and continue.

    _images/kube-deploy-26.png
  30. Once the Kubernetes setup has finished, check that all the nodes are online.

    root@bcm10-headnode:~# kubectl get nodes
    NAME       STATUS   ROLES                  AGE     VERSION
    dgx-01     Ready    worker                 5m56s   v1.28.10
    dgx-02     Ready    worker                 5m49s   v1.28.10
    ...
    knode-01   Ready    control-plane,master   6m26s   v1.28.10
    knode-02   Ready    control-plane,master   5m47s   v1.28.10
    knode-03   Ready    control-plane,master   5m56s   v1.28.10