Container Cache

View as Markdown

This guide provides detailed information on the Container Cache component and its use in NVCF GPU clusters.

Container Cache provides a container caching solution specifically optimized for NGC. It is designed to enhance the efficiency of Docker image pulls from NGC by caching the images locally, reducing network bandwidth and improving pull times for frequently accessed images.

The Container Cache acts as a proxy between your Kubernetes cluster and the NGC registry, caching frequently accessed container images locally to reduce network bandwidth usage and improve deployment times.

Installation

Prerequisites

  • A running Kubernetes cluster with kubectl access
  • Helm >= 3.12
  • Credentials for the registry where your NVCF charts and images are stored

Step 1. Authenticate Helm to your chart registry

Authenticate Helm to your OCI registry where the NVCF charts are stored:

$echo "${REGISTRY_PASSWORD}" | helm registry login ${REGISTRY} \
> --username '${REGISTRY_USERNAME}' --password-stdin

Step 2. Create the namespace and image pull secret

$kubectl create namespace container-caching

Create an image pull secret so that pods can pull container images from your registry.

$kubectl create secret docker-registry nvcr-creds \
> --docker-server=nvcr.io \
> --docker-username='$oauthtoken' \
> --docker-password="${NGC_API_KEY}" \
> --namespace=container-caching

The secret name nvcr-creds is referenced in the values file under images.secrets. If you use a different secret name, update the values file to match.

Step 3. Create a values file

Create a values.yaml using the complete example in [Base Configuration] below.

  • BYOC users pulling directly from NGC can use the nvcr.io/nvidia/nvcf-byoc/ image paths without mirroring.
  • Self-hosted users should replace the <your-registry>/<your-repo> placeholders with their mirrored registry path (see the Artifact Manifest in the self-hosted installation guide).

Adjust storageClassName for your environment (e.g., gp3 for AWS EKS).

Step 4. Install the chart

$helm upgrade --install container-cache \
> nvcf-byoc/nvcf-container-cache \
> --namespace container-caching \
> --values values.yaml

Step 5. Verify the installation

Container Cache deploys two workloads:

  • A StatefulSet (container-cache) with the number of replicas set by replicaCount. Each replica runs two containers (nginx proxy + prometheus exporter) and provisions two PVCs (cache and proxy-cache).
  • A DaemonSet (container-cache-cc) that runs on every node and configures the container runtime (containerd) to route image pulls through the cache.
$# StatefulSet replicas and DaemonSet pods should all be Running
$kubectl get pods -n container-caching
$
$# Verify the StatefulSet is fully ready (READY should match replicaCount)
$kubectl get statefulset -n container-caching
$
$# Verify the DaemonSet is running on all nodes
$kubectl get daemonset -n container-caching
$
$# Check services are created
$kubectl get svc -n container-caching
$
$# Check persistent volume claims are Bound
$kubectl get pvc -n container-caching

Base Configuration

The following is a complete example values.yaml for deploying Container Cache. Copy this file and adjust values for your environment. Each section is explained in detail below.

  • BYOC users can use the nvcr.io/nvidia/nvcf-byoc/ paths directly.
  • Self-hosted users should replace <your-registry>/<your-repo> with their mirrored registry path (see the Artifact Manifest in the self-hosted installation guide for source image paths).
1replicaCount: 3
2
3targetHost: nvcr.io,docker.io
4
5images:
6 server: <your-registry>/<your-repo>/nvcf-container-cache:v1.1.31
7 exporter: nginx/nginx-prometheus-exporter:1.0
8 certificates: <your-registry>/<your-repo>/nvcf-proxy-tls-certs:1.2.0
9 secrets:
10 - nvcr-creds
11
12cache:
13 keyStorageSize: 50m
14 maxSize: 180g
15 inactive: 1d
16 valid: 1h
17
18persistentVolumeClaim:
19 sizeGB: 100
20 storageClassName: gp3 # Use gp3 for AWS EKS, adjust for other platforms
21 sizeProxyGB: 100
22
23service:
24 type: ClusterIP
25 port: 30345
26
27metrics:
28 cacheMetricsStorageSize: 300m
29 throughputHistogramBuckets: 25000000, 30000000, 35000000, 40000000, 50000000, 60000000, 80000000, 100000000
30
31resources:
32 requests:
33 memory: 2Gi
34 cpu: "1"
35 limits:
36 memory: 4Gi
37 cpu: "2"
38
39traces:
40 enabled: false
41
42nucleus:
43 enabled: false
44
45vault:
46 enabled: false
47
48monitoring:
49 enabled: false

Configuration Sections

Replicas

The number of Container Cache pods is controlled through the replicaCount value. Container Cache replicas operate independently and distribute requests from worker nodes.

1# values.yaml
2
3# Min Value: 1
4# Recommended Value: 3
5replicaCount: 3

Container Cache is designed to scale horizontally to handle increased load.

Node Selection

Container Cache pods are scheduled on nodes with appropriate labels to ensure they run on compute nodes. Adjust the node selector based on your cluster’s node labeling scheme.

1# values.yaml
2
3nodeSelector:
4 nvcf.nvidia.com/workload: gpu # Adjust based on your node labels, or remove if not using node labels

Target Hosts

The Container Cache can proxy requests to multiple container registries. The default configuration includes both NGC and Docker Hub.

1# values.yaml
2
3# Domain of target Host where the proxy passes the request to
4targetHost: nvcr.io,docker.io

Image Configuration

Container Cache uses specific images for the server, exporter, and certificates:

1# values.yaml
2
3images:
4 # Container Cache Nginx Proxy
5 server: <your-registry>/<your-repo>/nvcf-container-cache:v1.1.31
6
7 # Nginx Prometheus Exporter (public image, no mirroring required)
8 exporter: nginx/nginx-prometheus-exporter:1.0
9
10 # TLS Certificates
11 certificates: <your-registry>/<your-repo>/nvcf-proxy-tls-certs:1.2.0
12
13 # Image pull secret created in Step 2
14 secrets:
15 - nvcr-creds

Replace <your-registry>/<your-repo> with your registry path. BYOC users pulling directly from NGC can use nvcr.io/nvidia/nvcf-byoc as the registry/repo path.

Cache Configuration

The cache behavior is controlled through several parameters:

1# values.yaml
2
3cache:
4 # Size for storing cache keys
5 keyStorageSize: 50m
6
7 # Maximum size of the cache
8 maxSize: 180g
9
10 # Period a resource can remain in cache without being accessed
11 inactive: 1d
12
13 # Period a cache is valid if resource doesn't become inactive first
14 valid: 1h

Storage Configuration

Container Cache requires persistent storage for caching container images:

1# values.yaml
2
3persistentVolumeClaim:
4 # Size of persistent volume
5 sizeGB: 100
6
7 # Storage class for persistent volume claim
8 storageClassName: gp3 # Use gp3 for Amazon EKS, adjust for other platforms
9
10 # Size of persistent volume for proxy cache
11 sizeProxyGB: 100

Service Configuration

The service type and port can be configured based on your access requirements:

1# values.yaml
2
3service:
4 # Service type: ClusterIP, NodePort, or LoadBalancer
5 type: ClusterIP
6
7 # Port for the Container Cache service
8 port: 30345

Metrics Configuration

Container Cache includes Prometheus metrics for monitoring cache performance:

1# values.yaml
2
3metrics:
4 # Size for storing cache metrics
5 cacheMetricsStorageSize: 300m
6
7 # Bucket configuration for throughput histogram
8 throughputHistogramBuckets: 25000000, 30000000, 35000000, 40000000, 50000000, 60000000, 80000000, 100000000

Resource Requests and Limits

Resource requests and limits for the Container Cache StatefulSet pods are required. The chart will fail to install without them.

1# values.yaml
2
3resources:
4 requests:
5 memory: 2Gi
6 cpu: "1"
7 limits:
8 memory: 4Gi
9 cpu: "2"

Adjust these values based on your cluster size and expected cache throughput. Larger deployments with high pull rates may need more memory and CPU.

Architecture

Container Cache Architecture

Container Cache consists of several components:

  1. Nginx Proxy Server: Handles incoming requests and serves cached content
  2. Prometheus Exporter: Provides metrics for monitoring
  3. Persistent Storage: Stores cached container images
  4. DaemonSet: Configures containerd on worker nodes to use the cache

Data Flow

  1. Initial Request: Worker node requests a container image
  2. Cache Check: Container Cache checks if image is cached locally
  3. Cache Hit: If cached, serve image directly from local storage
  4. Cache Miss: If not cached, fetch from upstream registry and cache locally
  5. Response: Return image to requesting worker node

Performance Considerations

Cache Size

The cache size should be sized based on your workload requirements:

  • Small deployments: 50-100GB
  • Medium deployments: 100-500GB
  • Large deployments: 500GB+

Storage Performance

For optimal performance, use high-performance storage:

  • AWS: Use gp3 or io1/io2 EBS volumes
  • Azure: Use Premium SSD storage
  • GCP: Use SSD persistent disks

Network Configuration

Container Cache should be deployed close to worker nodes to minimize network latency:

  • Deploy in the same availability zone as worker nodes
  • Use high-bandwidth network connections
  • Consider using dedicated network interfaces for cache traffic

Monitoring and Observability

Metrics

Container Cache provides several key metrics:

  • Cache hit ratio: Percentage of requests served from cache
  • Cache size: Current size of cached data
  • Request throughput: Number of requests per second
  • Response times: Time to serve cached vs. uncached content

Logging

Container Cache logs include:

  • Cache hit/miss events
  • Upstream registry communication
  • Error conditions and troubleshooting information
  • Performance metrics

Troubleshooting

Common Issues

Cache Not Working : - Verify containerd configuration on worker nodes

  • Check network connectivity to Container Cache service
  • Ensure proper DNS resolution

Low Cache Hit Ratio : - Review cache size configuration

  • Check cache eviction policies
  • Monitor storage performance

Storage Issues : - Verify storage class availability

  • Check persistent volume claims
  • Monitor disk space usage

Best Practices

Deployment

  • Deploy Container Cache before deploying workloads
  • Use multiple replicas for high availability
  • Monitor cache performance and adjust configuration as needed

Configuration

  • Size cache appropriately for your workload
  • Use high-performance storage for better performance
  • Configure appropriate cache eviction policies

Maintenance

  • Monitor cache hit ratios and adjust cache size as needed
  • Regularly review and clean up unused cached images
  • Update Container Cache images regularly for security patches