Kubernetes Installation#

The installation of Kubernetes (K8s) is done through the cm-kubernetes-setup tool included with BCM 11.

There are two Kubernetes clusters that are created as part of the NMC software stack. This section covers the first, “k8s-admin”, where system software in support of administrative functions are located.

cm-kubernetes setup#

To set up cm-kubernetes, follow the steps below:

  1. From the active BCM headnode, run the cm-kubernetes-setup command.

  2. Choose Deploy and then select Ok.

    Deploy Kubernetes
  3. Choose Kubernetes version v1.32 and then select Ok.

    Choose Kubernetes Version
  4. If K8s is reinstalled there might be an error about version conflicts (from previous unsuccessful deployments), the following screen will be shown. In this case, select Ok to proceed.

    K8s Version Conflict
  5. Provide a DockerHub Container registry mirror, if required, otherwise leave blank.

    DockerHub Container Registry Mirror
  6. K8s cluster and networking. This is where the name “k8s-admin” gets defined, and a module config file is created with the same name used here.

    1. For network, the subnets need to be in private address space (per RFC 1918). Use the default values and only modify if necessary or in case of conflict with other internal subnets within the network.

    2. Ensure that the domain names of the networks are configured correctly and do the necessary modifications as required to match the “Kubernetes External FQDN” using the same domain as in 2.2 Customer DNS records.

    Kubernetes External FQDN
    1. Ensure that the subnets above do not overlap with the customer’s private IP ranges. The Pod Network subnet cannot be changed without reinstalling the cluster.

  7. Choose no to exposing the Kubernetes API servers to the cluster’s external network and then select Ok.

    The external network is defined in BCM base partition:

    cmsh -c "partition get base externalnetwork"
    
    Exposing the Kubernetes API servers to the cluster's external network
  8. Choose the internal network that the K8s nodes will use.

    In a DGX SuperPOD deployment, the management and compute nodes may be on different BCM networks. Choose the network used by the management nodes during this step.

    Internal Network
  9. Select at least three K8s control nodes.

    Select at least three K8s control nodes
  10. Choose the BCM node categories for the K8s worker node pool.

    BCM Node Categories
  11. Optional - “Choose individual Kubernetes worker nodes” TUI screen - DO NOT make any selections in this step and instead hit the OK button to proceed to the next step.

    Choose individual Kubernetes worker nodes
  12. Choose the Etcd nodes and then select Ok.

    Choose the same three nodes as the K8s control nodes.

    Choose Etcd nodes

    Note

    Etcd is an open source, distributed, consistent key-value store for shared configuration and scheduler coordination of distributed systems or clusters. It is used as the default storage backend for Kubernetes. For more information on Etcd, see the Etcd documentation.

  13. Ignore the following message if it appears.

    Ignore the following message if it appears
  14. Set the ports as shown below and do not modify the Etcd spool directory.

    Set the ports as shown below and do not modify the Etcd spool directory
  15. Choose Calico as the network plugin and then select Ok.

    Choose Calico as the network plugin
  16. Choose no for install Kyverno Policy Engine? and then select Ok. You can enable Kyverno at a later stage.

    Choose no for install Kyverno Policy Engine
  17. Choose the operators listed below to be installed and then select Ok.

    • Grafana Promtail, Grafana Loki, Ingress NGINX Controller

    • Kubernetes Dashboard, Kubernetes Metrics Server

    • Kubernetes State metrics

    • Prometheus Operator stack, Prometheus Adapter

    Choose the operators to be installed
  18. Choose yes when asked to expose the Ingress service over port 443 and then select Ok.

    Choose yes to expose the Ingress service over port 443
  19. Keep the default value in the next screen and then select Ok.

    Keep the default value in the next screen
  20. Keep the Ingress HTTPS port as 30443 (default value) and then select Ok.

    Keep the Ingress HTTPS port as 30443
  21. Choose no to install BCM NVIDIA packages and then select Ok.

    Choose no to install BCM NVIDIA packages
  22. Choose no to install the BCM NVIDIA Container Toolkit package and then select Ok.

    Choose no to install the BCM NVIDIA Container Toolkit package
  23. Choose yes to install the Permissions Manager and then select Ok.

    Choose yes to install the Permissions Manager
  24. Choose Local path as a StorageClass (press Enter, not Tab [this is a bug])

    If Tab is pressed by mistake, then press r for retry to come back to this screen.

    Choose Local path as a StorageClass
  25. Keep the default path for the Kubernetes storage pool and then select Ok.

    Keep the default path for Kubernetes storage pool
  26. Choose yes to enable local persistent storage for Grafana and then select Ok.

    Choose yes to enable local persistent storage for Grafana
  27. Choose SimpleScalable for the Loki deployment mode and then select Ok.

    Choose **SimpleScalable** for the Loki deployment mode
  28. Select both options for Loki access and then select Ok.

    Select both options for Loki access
  29. Configure the Username and Password for Loki (basic authentication using ingress-nginx).

    Set both to loki, they can be changed later.

    Configure Username and Password for Loki
  30. Select the varlog option for Grafana Promtail and then select Ok.

    Select the varlog option for Grafana Promtail
  31. Choose save config & deploy and then select Ok.

    The deployment will begin after this step. Halfway through the deployment all nodes that are members of the K8s cluster will reboot, and the installer will wait up to 60 minutes for all nodes to come back online.

    Choose save config & deploy
    Validate the following K8s configuration
    Output from above command (For verification)
  32. Validate the following K8s configuration using the commands below:

    # Load the kubernetes module
    module load kubernetes
    
    # Get the nodes in the cluster
    kubectl get nodes
    kubectl get pods -A