Container Cache | NVIDIA Cloud Functions

This guide provides detailed information on the Container Cache component and its use in NVCF GPU clusters.

Container Cache provides a container caching solution specifically optimized for NGC. It is designed to enhance the efficiency of Docker image pulls from NGC by caching the images locally, reducing network bandwidth and improving pull times for frequently accessed images.

The Container Cache acts as a proxy between your Kubernetes cluster and the NGC registry, caching frequently accessed container images locally to reduce network bandwidth usage and improve deployment times.

Installation

Prerequisites

A running Kubernetes cluster with kubectl access
Helm >= 3.12
Credentials for the registry where your NVCF charts and images are stored

Step 1. Authenticate Helm to your chart registry

Authenticate Helm to your OCI registry where the NVCF charts are stored:

$ echo "${REGISTRY_PASSWORD}" | helm registry login ${REGISTRY} \
>   --username '${REGISTRY_USERNAME}' --password-stdin

Step 2. Create the namespace and image pull secret

$ kubectl create namespace container-caching

Create an image pull secret so that pods can pull container images from your registry.

BYOC / NGC

Amazon ECR

Other Registry

$ kubectl create secret docker-registry nvcr-creds \
>   --docker-server=nvcr.io \
>   --docker-username='$oauthtoken' \
>   --docker-password="${NGC_API_KEY}" \
>   --namespace=container-caching

The secret name nvcr-creds is referenced in the values file under images.secrets. If you use a different secret name, update the values file to match.

Step 3. Create a values file

Create a values.yaml using the complete example in [Base Configuration] below.

BYOC users pulling directly from NGC can use the nvcr.io/nvidia/nvcf-byoc/ image paths without mirroring.
Self-hosted users should replace the <your-registry>/<your-repo> placeholders with their mirrored registry path (see the Artifact Manifest in the self-hosted installation guide).

Adjust storageClassName for your environment (e.g., gp3 for AWS EKS).

Step 4. Install the chart

BYOC (NGC Helm Repo)

Self-Hosted (OCI)

$ helm upgrade --install container-cache \
>   nvcf-byoc/nvcf-container-cache \
>   --namespace container-caching \
>   --values values.yaml

Step 5. Verify the installation

Container Cache deploys two workloads:

A StatefulSet (container-cache) with the number of replicas set by replicaCount. Each replica runs two containers (nginx proxy + prometheus exporter) and provisions two PVCs (cache and proxy-cache).
A DaemonSet (container-cache-cc) that runs on every node and configures the container runtime (containerd) to route image pulls through the cache.

$ # StatefulSet replicas and DaemonSet pods should all be Running
$ kubectl get pods -n container-caching
$ 
$ # Verify the StatefulSet is fully ready (READY should match replicaCount)
$ kubectl get statefulset -n container-caching
$ 
$ # Verify the DaemonSet is running on all nodes
$ kubectl get daemonset -n container-caching
$ 
$ # Check services are created
$ kubectl get svc -n container-caching
$ 
$ # Check persistent volume claims are Bound
$ kubectl get pvc -n container-caching

Base Configuration

The following is a complete example values.yaml for deploying Container Cache. Copy this file and adjust values for your environment. Each section is explained in detail below.

BYOC users can use the nvcr.io/nvidia/nvcf-byoc/ paths directly.
Self-hosted users should replace <your-registry>/<your-repo> with their mirrored registry path (see the Artifact Manifest in the self-hosted installation guide for source image paths).

1 replicaCount: 3
2 
3 targetHost: nvcr.io,docker.io
4 
5 images:
6   server: <your-registry>/<your-repo>/nvcf-container-cache:v1.1.31
7   exporter: nginx/nginx-prometheus-exporter:1.0
8   certificates: <your-registry>/<your-repo>/nvcf-proxy-tls-certs:1.2.0
9   secrets:
10     - nvcr-creds
11 
12 cache:
13   keyStorageSize: 50m
14   maxSize: 180g
15   inactive: 1d
16   valid: 1h
17 
18 persistentVolumeClaim:
19   sizeGB: 100
20   storageClassName: gp3  # Use gp3 for AWS EKS, adjust for other platforms
21   sizeProxyGB: 100
22 
23 service:
24   type: ClusterIP
25   port: 30345
26 
27 metrics:
28   cacheMetricsStorageSize: 300m
29   throughputHistogramBuckets: 25000000, 30000000, 35000000, 40000000, 50000000, 60000000, 80000000, 100000000
30 
31 resources:
32   requests:
33     memory: 2Gi
34     cpu: "1"
35   limits:
36     memory: 4Gi
37     cpu: "2"
38 
39 traces:
40   enabled: false
41 
42 nucleus:
43   enabled: false
44 
45 vault:
46   enabled: false
47 
48 monitoring:
49   enabled: false

Configuration Sections

Replicas

The number of Container Cache pods is controlled through the replicaCount value. Container Cache replicas operate independently and distribute requests from worker nodes.

1 # values.yaml
2 
3 # Min Value: 1
4 # Recommended Value: 3
5 replicaCount: 3

Container Cache is designed to scale horizontally to handle increased load.

Node Selection

Container Cache pods are scheduled on nodes with appropriate labels to ensure they run on compute nodes. Adjust the node selector based on your cluster’s node labeling scheme.

1 # values.yaml
2 
3 nodeSelector:
4   nvcf.nvidia.com/workload: gpu  # Adjust based on your node labels, or remove if not using node labels

Target Hosts

The Container Cache can proxy requests to multiple container registries. The default configuration includes both NGC and Docker Hub.

1 # values.yaml
2 
3 # Domain of target Host where the proxy passes the request to
4 targetHost: nvcr.io,docker.io

Image Configuration

Container Cache uses specific images for the server, exporter, and certificates:

1 # values.yaml
2 
3 images:
4   # Container Cache Nginx Proxy
5   server: <your-registry>/<your-repo>/nvcf-container-cache:v1.1.31
6 
7   # Nginx Prometheus Exporter (public image, no mirroring required)
8   exporter: nginx/nginx-prometheus-exporter:1.0
9 
10   # TLS Certificates
11   certificates: <your-registry>/<your-repo>/nvcf-proxy-tls-certs:1.2.0
12 
13   # Image pull secret created in Step 2
14   secrets:
15     - nvcr-creds

Replace <your-registry>/<your-repo> with your registry path. BYOC users pulling directly from NGC can use nvcr.io/nvidia/nvcf-byoc as the registry/repo path.

Cache Configuration

The cache behavior is controlled through several parameters:

1 # values.yaml
2 
3 cache:
4   # Size for storing cache keys
5   keyStorageSize: 50m
6 
7   # Maximum size of the cache
8   maxSize: 180g
9 
10   # Period a resource can remain in cache without being accessed
11   inactive: 1d
12 
13   # Period a cache is valid if resource doesn't become inactive first
14   valid: 1h

Storage Configuration

Container Cache requires persistent storage for caching container images:

1 # values.yaml
2 
3 persistentVolumeClaim:
4   # Size of persistent volume
5   sizeGB: 100
6 
7   # Storage class for persistent volume claim
8   storageClassName: gp3  # Use gp3 for Amazon EKS, adjust for other platforms
9 
10   # Size of persistent volume for proxy cache
11   sizeProxyGB: 100

Service Configuration

The service type and port can be configured based on your access requirements:

1 # values.yaml
2 
3 service:
4   # Service type: ClusterIP, NodePort, or LoadBalancer
5   type: ClusterIP
6 
7   # Port for the Container Cache service
8   port: 30345

Metrics Configuration

Container Cache includes Prometheus metrics for monitoring cache performance:

1 # values.yaml
2 
3 metrics:
4   # Size for storing cache metrics
5   cacheMetricsStorageSize: 300m
6 
7   # Bucket configuration for throughput histogram
8   throughputHistogramBuckets: 25000000, 30000000, 35000000, 40000000, 50000000, 60000000, 80000000, 100000000

Resource Requests and Limits

Resource requests and limits for the Container Cache StatefulSet pods are required. The chart will fail to install without them.

1 # values.yaml
2 
3 resources:
4   requests:
5     memory: 2Gi
6     cpu: "1"
7   limits:
8     memory: 4Gi
9     cpu: "2"

Adjust these values based on your cluster size and expected cache throughput. Larger deployments with high pull rates may need more memory and CPU.

Architecture

Container Cache Architecture

Container Cache consists of several components:

Nginx Proxy Server: Handles incoming requests and serves cached content
Prometheus Exporter: Provides metrics for monitoring
Persistent Storage: Stores cached container images
DaemonSet: Configures containerd on worker nodes to use the cache

Data Flow

Initial Request: Worker node requests a container image
Cache Check: Container Cache checks if image is cached locally
Cache Hit: If cached, serve image directly from local storage
Cache Miss: If not cached, fetch from upstream registry and cache locally
Response: Return image to requesting worker node

Performance Considerations

Cache Size

The cache size should be sized based on your workload requirements:

Small deployments: 50-100GB
Medium deployments: 100-500GB
Large deployments: 500GB+

Storage Performance

For optimal performance, use high-performance storage:

AWS: Use gp3 or io1/io2 EBS volumes
Azure: Use Premium SSD storage
GCP: Use SSD persistent disks

Network Configuration

Container Cache should be deployed close to worker nodes to minimize network latency:

Deploy in the same availability zone as worker nodes
Use high-bandwidth network connections
Consider using dedicated network interfaces for cache traffic

Monitoring and Observability

Metrics

Container Cache provides several key metrics:

Cache hit ratio: Percentage of requests served from cache
Cache size: Current size of cached data
Request throughput: Number of requests per second
Response times: Time to serve cached vs. uncached content

Logging

Container Cache logs include:

Cache hit/miss events
Upstream registry communication
Error conditions and troubleshooting information
Performance metrics

Troubleshooting

Common Issues

Cache Not Working : - Verify containerd configuration on worker nodes

Check network connectivity to Container Cache service
Ensure proper DNS resolution

Low Cache Hit Ratio : - Review cache size configuration

Check cache eviction policies
Monitor storage performance

Storage Issues : - Verify storage class availability

Check persistent volume claims
Monitor disk space usage

Best Practices

Deployment

Deploy Container Cache before deploying workloads
Use multiple replicas for high availability
Monitor cache performance and adjust configuration as needed

Configuration

Size cache appropriately for your workload
Use high-performance storage for better performance
Configure appropriate cache eviction policies

Maintenance

Monitor cache hit ratios and adjust cache size as needed
Regularly review and clean up unused cached images
Update Container Cache images regularly for security patches