Kubernetes Installation#

The installation of Kubernetes (K8s) is done through the cm-kubernetes-setup tool included with BCM 11.

There are two Kubernetes clusters that are created as part of the NMC software stack. This section covers the first, “k8s-admin”, where system software in support of administrative functions are located.

cm-kubernetes setup#

  1. From the active BCM headnode, run the cm-kubernetes-setup command.

  2. Choose Deploy and then select Ok.

    Deploy Kubernetes
  3. Choose Kubernetes version v1.32 and then select Ok.

    Choose Kubernetes Version
  4. If K8s is reinstalled there might be an error about version conflicts (from previous unsuccessful deployments), the following screen will be shown. In this case, select Ok to proceed.

    K8s Version Conflict
  5. Provide a DockerHub Container registry mirror, if required, otherwise leave blank.

    DockerHub Container Registry Mirror
  6. K8s cluster and networking. This is where the name “k8s-admin” gets defined, and a module config file will be created with the same name used here.

    1. For network, the subnets need to be in private address space (per RFC 1918). Use the default values and only modify if necessary or in case of conflict with other internal subnets within the network.

    2. Ensure that the domain names of the networks are configured correctly and do the necessary modifications as required to match the “Kubernetes External FQDN” using the same domain as in 2.2 Customer DNS records.

    Kubernetes External FQDN
    1. Ensure that the subnets above do not overlap with the customer’s private IP ranges. The Pod Network subnet cannot be changed without reinstalling the cluster.

  7. Choose no to exposing the Kubernetes API servers to the cluster’s external network and then select Ok.

    The external network is defined in BCM base partition:

    cmsh -c "partition get base externalnetwork"
    
    Exposing the Kubernetes API servers to the cluster's external network
  8. Choose the internal network that will be used by the K8s nodes.

    In a DGX SuperPOD deployment, the management and compute nodes may be on different BCM networks. Choose the network used by the management nodes during this step.

    Internal Network
  9. Select at least three K8s control nodes.

    Select at least three K8s control nodes
  10. Choose the BCM node categories for the K8s worker node pool.

    BCM Node Categories
  11. Optional - “Choose individual Kubernetes worker nodes” TUI screen - DO NOT make any selections in this step and instead hit the OK button to proceed to the next step.

    Choose individual Kubernetes worker nodes
  12. Choose the Etcd nodes and then select Ok.

    Choose the same three nodes as the K8s control nodes.

    Choose Etcd nodes
  13. Ignore the following message if it appears.

    Ignore the following message if it appears
  14. Set the ports as shown below and do not modify the Etcd spool directory.

    Set the ports as shown below and do not modify the Etcd spool directory
  15. Choose Calico as the network plugin and then select Ok.

    Choose Calico as the network plugin
  16. Choose no for install Kyverno Policy Engine? and then select Ok. Kyverno can be enabled at a later stage.

    Choose no for install Kyverno Policy Engine
  17. Choose the operators to be installed and then select Ok.

    • Grafana Promtail, Grafana Loki, Ingress NGINX Controller

    • Kubernetes Dashboard, Kubernetes Metrics Server

    • Kubernetes State metrics

    • Prometheus Operator stack, Prometheus Adapter

    Choose the operators to be installed
  18. Choose yes when asked to expose the Ingress service over port 443 and then select Ok.

    Choose yes to expose the Ingress service over port 443
  19. Keep the default value in the next screen and then select Ok.

    Keep the default value in the next screen
  20. Keep the Ingress HTTPS port as 30443 (default value) and then select Ok.

    Keep the Ingress HTTPS port as 30443
  21. Choose no to install BCM NVIDIA packages and then select Ok.

    Choose no to install BCM NVIDIA packages
  22. Choose no to install the BCM NVIDIA Container Toolkit package and then select Ok.

    Choose no to install the BCM NVIDIA Container Toolkit package
  23. Choose yes to install the Permissions Manager and then select Ok.

    Choose yes to install the Permissions Manager
  24. Choose Local path as a StorageClass (press Enter, not Tab [this is a bug])

    If Tab is pressed by mistake, then press r for retry to come back to this screen.

    Choose Local path as a StorageClass
  25. Keep the default path to store data and then select Ok.

    Keep the default path to store data
  26. Choose yes to enable local persistent storage for Grafana and then select Ok.

    Choose yes to enable local persistent storage for Grafana
  27. Choose SimpleScalable for the Loki deployment mode and then select Ok.

    Choose SimpleScalable for the Loki deployment mode
  28. Select both options for Loki access and then select Ok.

    Select both options for Loki access
  29. Configure Username and Password for Loki (basic auth via ingress-nginx).

    Set both to loki, they can be changed later.

    Configure Username and Password for Loki
  30. Choose the varlog option for Grafana Promtail and then select Ok.

    Choose the varlog option for Grafana Promtail
  31. Choose save config & deploy and then select Ok.

    At this point the deployment will start. Halfway through the deployment all nodes that are members of the K8s cluster will be rebooted, and the installer will wait up to 60 minutes for all nodes to come back online.

    Choose save config & deploy
    Validate the following K8s configuration
    Output from above command (For verification)
  32. Validate the following K8s configuration:

    # Load the kubernetes module
    module load kubernetes
    kubectl get nodes
    kubectl get pods -A