High Availability Configuration#

Deploy the NeMo Guardrails microservice in a high-availability configuration for improved reliability and performance.

Overview#

High availability (HA) deployment provides:

Fault tolerance: The service continues running even if individual pods fail.
Scalability: Handle increased traffic by adding more replicas.

Basic High Availability#

Start with a simple multi-replica deployment by updating the values.yaml file with the following values:

guardrails:
   replicaCount: 3

This configuration:

Deploys three instances of the guardrails service.
Shares the guardrails configuration data across all instances.
Provides basic load balancing and failover.

Advanced High Availability#

Set up advanced high availability by configuring pod anti-affinity rules.

For more information about anti-affinity rules including topology.kubernetes.io/zone for greater resilience, refer to Affinity and Anti-Affinity in the Kubernetes documentation.

Pod Anti-Affinity Rules#

For improved availability, ensure pods are scheduled across different nodes:

guardrails:
  replicaCount: 3
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - topologyKey: "kubernetes.io/hostname"
        labelSelector:
          matchLabels:
            app.kubernetes.io/instance: nemo
            app.kubernetes.io/name: guardrails

requiredDuringSchedulingIgnoredDuringExecution indicates a hard requirement. This means the scheduler must place each of the three pods on a different node. If it cannot find three distinct nodes that satisfy this condition, the pods will remain in a pending state and will not be scheduled.
topologyKey: "kubernetes.io/hostname" tells the scheduler to use the node’s hostname as the basis for spreading the pods. This ensures that the pods are spread across different nodes.
labelSelector identifies the pods that should be subject to this anti-affinity rule. In this example snippet, it applies to any pod with the labels app.kubernetes.io/instance: nemo and app.kubernetes.io/name: guardrails. Change the labels to match your deployment accordingly.

Preferred Anti-Affinity#

For clusters with limited nodes, use preferred anti-affinity:

guardrails:
  replicaCount: 3
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          topologyKey: "kubernetes.io/hostname"
          labelSelector:
            matchLabels:
              app.kubernetes.io/instance: nemo
              app.kubernetes.io/name: guardrails

Resource Configuration for Scaling#

Configure resources to scale the NeMo Guardrails microservice by adding more details for multi-replica deployment or horizontal pod autoscaler.

Resource Requests and Limits#

To configure appropriate resources for multiple replicas, update the values.yaml file with the following values:

guardrails:
  replicaCount: 3
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 1000m
      memory: 2Gi

Horizontal Pod Autoscaler#

To enable automatic scaling based on resource usage, update the values.yaml file with the following values:

guardrails:
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 80

Scaling NIM Microservices#

In clusters that lack a monitoring stack such as Prometheus and Prometheus Adapter, you can only scale a NIM based on CPU usage. For larger models, the compute performance is typically evaluated by GPU resources rather than CPU. Therefore, it is recommended to use a monitoring stack to enable scaling based on GPU usage.

The NIM for LLMs microservices Helm chart has autoscaling configuration based on GPU usage. Depending on whether you use the NeMo Deployment Management microservice or Helm-based NIM deployments, there are two ways to enable autoscaling for the NIM for LLMs microservices.

Set Up NeMo Deployment Management Microservice While Installing NeMo Guardrails#

If you use the NeMo Deployment Management microservice, enable autoscaling of NIM for LLMs microservices in the NeMo Deployment Management microservice during the basic installation of NeMo Guardrails. To enable, set deployment-management.deployments.autoscaling.enabled to true in the values file. After you set this up and use the NeMo Deployment Management endpoint to deploy a NIM microservice, the NeMo Deployment Management microservice applies the autoscaling configuration to the NIM for LLMs microservice.

For more information, refer to NeMo Deployment Management Configuration.

Set Up While Deploying NIM for LLMs Microservices#

If you deploy NIM microservices outside of the NeMo platform, enable autoscaling by setting the scaling parameters in the LLM NIM Helm chart directly.

For more information, refer to the following references: