Container Cache#
This guide provides detailed information on the Container Cache component and its use in NVCF GPU clusters.
Container Cache provides a container caching solution specifically optimized for NGC. It is designed to enhance the efficiency of Docker image pulls from NGC by caching the images locally, reducing network bandwidth and improving pull times for frequently accessed images.
The Container Cache acts as a proxy between your Kubernetes cluster and the NGC registry, caching frequently accessed container images locally to reduce network bandwidth usage and improve deployment times.
Installation#
Prerequisites#
A running Kubernetes cluster with
kubectlaccessHelm >= 3.12
Credentials for the registry where your NVCF charts and images are stored
Step 1. Authenticate Helm to your chart registry#
Authenticate Helm to your OCI registry where the NVCF charts are stored:
echo "${REGISTRY_PASSWORD}" | helm registry login ${REGISTRY} \
--username '${REGISTRY_USERNAME}' --password-stdin
Step 2. Create the namespace and image pull secret#
kubectl create namespace container-caching
Create an image pull secret so that pods can pull container images from your registry.
kubectl create secret docker-registry nvcr-creds \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password="${NGC_API_KEY}" \
--namespace=container-caching
kubectl create secret docker-registry nvcr-creds \
--docker-server=<account-id>.dkr.ecr.<region>.amazonaws.com \
--docker-username=AWS \
--docker-password="$(aws ecr get-login-password --region <region>)" \
--namespace=container-caching
kubectl create secret docker-registry nvcr-creds \
--docker-server=<your-registry> \
--docker-username=<your-username> \
--docker-password=<your-password> \
--namespace=container-caching
Note
The secret name nvcr-creds is referenced in the values file under images.secrets.
If you use a different secret name, update the values file to match.
Step 3. Create a values file#
Create a values.yaml using the complete example in Base Configuration below.
BYOC users pulling directly from NGC can use the
nvcr.io/nvidia/nvcf-byoc/image paths without mirroring.Self-hosted users should replace the
<your-registry>/<your-repo>placeholders with their mirrored registry path (see the Artifact Manifest in the self-hosted installation guide).
Adjust storageClassName for your environment (e.g., gp3 for AWS EKS).
Step 4. Install the chart#
helm upgrade --install container-cache \
nvcf-byoc/nvcf-container-cache \
--namespace container-caching \
--values values.yaml
Replace <your-registry>/<your-repo> with your mirrored registry path.
helm upgrade --install container-cache \
oci://<your-registry>/<your-repo>/nvcf-container-cache \
--version 0.25.6 \
--namespace container-caching \
--values values.yaml
Step 5. Verify the installation#
Container Cache deploys two workloads:
A StatefulSet (
container-cache) with the number of replicas set byreplicaCount. Each replica runs two containers (nginx proxy + prometheus exporter) and provisions two PVCs (cacheandproxy-cache).A DaemonSet (
container-cache-cc) that runs on every node and configures the container runtime (containerd) to route image pulls through the cache.
# StatefulSet replicas and DaemonSet pods should all be Running
kubectl get pods -n container-caching
# Verify the StatefulSet is fully ready (READY should match replicaCount)
kubectl get statefulset -n container-caching
# Verify the DaemonSet is running on all nodes
kubectl get daemonset -n container-caching
# Check services are created
kubectl get svc -n container-caching
# Check persistent volume claims are Bound
kubectl get pvc -n container-caching
Base Configuration#
The following is a complete example values.yaml for deploying Container Cache.
Copy this file and adjust values for your environment. Each section is explained in detail
below.
BYOC users can use the
nvcr.io/nvidia/nvcf-byoc/paths directly.Self-hosted users should replace
<your-registry>/<your-repo>with their mirrored registry path (see the Artifact Manifest in the self-hosted installation guide for source image paths).
replicaCount: 3
targetHost: nvcr.io,docker.io
images:
server: <your-registry>/<your-repo>/nvcf-container-cache:v1.1.31
exporter: nginx/nginx-prometheus-exporter:1.0
certificates: <your-registry>/<your-repo>/nvcf-proxy-tls-certs:1.2.0
secrets:
- nvcr-creds
cache:
keyStorageSize: 50m
maxSize: 180g
inactive: 1d
valid: 1h
persistentVolumeClaim:
sizeGB: 100
storageClassName: gp3 # Use gp3 for AWS EKS, adjust for other platforms
sizeProxyGB: 100
service:
type: ClusterIP
port: 30345
metrics:
cacheMetricsStorageSize: 300m
throughputHistogramBuckets: 25000000, 30000000, 35000000, 40000000, 50000000, 60000000, 80000000, 100000000
resources:
requests:
memory: 2Gi
cpu: "1"
limits:
memory: 4Gi
cpu: "2"
traces:
enabled: false
nucleus:
enabled: false
vault:
enabled: false
monitoring:
enabled: false
Configuration Sections#
Replicas#
The number of Container Cache pods is controlled through the replicaCount value. Container Cache replicas operate independently and distribute requests from worker nodes.
# values.yaml
# Min Value: 1
# Recommended Value: 3
replicaCount: 3
Important
Container Cache is designed to scale horizontally to handle increased load.
Node Selection#
Container Cache pods are scheduled on nodes with appropriate labels to ensure they run on compute nodes. Adjust the node selector based on your cluster’s node labeling scheme.
# values.yaml
nodeSelector:
nvcf.nvidia.com/workload: gpu # Adjust based on your node labels, or remove if not using node labels
Target Hosts#
The Container Cache can proxy requests to multiple container registries. The default configuration includes both NGC and Docker Hub.
# values.yaml
# Domain of target Host where the proxy passes the request to
targetHost: nvcr.io,docker.io
Image Configuration#
Container Cache uses specific images for the server, exporter, and certificates:
# values.yaml
images:
# Container Cache Nginx Proxy
server: <your-registry>/<your-repo>/nvcf-container-cache:v1.1.31
# Nginx Prometheus Exporter (public image, no mirroring required)
exporter: nginx/nginx-prometheus-exporter:1.0
# TLS Certificates
certificates: <your-registry>/<your-repo>/nvcf-proxy-tls-certs:1.2.0
# Image pull secret created in Step 2
secrets:
- nvcr-creds
Replace <your-registry>/<your-repo> with your registry path. BYOC users pulling
directly from NGC can use nvcr.io/nvidia/nvcf-byoc as the registry/repo path.
Cache Configuration#
The cache behavior is controlled through several parameters:
# values.yaml
cache:
# Size for storing cache keys
keyStorageSize: 50m
# Maximum size of the cache
maxSize: 180g
# Period a resource can remain in cache without being accessed
inactive: 1d
# Period a cache is valid if resource doesn't become inactive first
valid: 1h
Storage Configuration#
Container Cache requires persistent storage for caching container images:
# values.yaml
persistentVolumeClaim:
# Size of persistent volume
sizeGB: 100
# Storage class for persistent volume claim
storageClassName: gp3 # Use gp3 for Amazon EKS, adjust for other platforms
# Size of persistent volume for proxy cache
sizeProxyGB: 100
Service Configuration#
The service type and port can be configured based on your access requirements:
# values.yaml
service:
# Service type: ClusterIP, NodePort, or LoadBalancer
type: ClusterIP
# Port for the Container Cache service
port: 30345
Metrics Configuration#
Container Cache includes Prometheus metrics for monitoring cache performance:
# values.yaml
metrics:
# Size for storing cache metrics
cacheMetricsStorageSize: 300m
# Bucket configuration for throughput histogram
throughputHistogramBuckets: 25000000, 30000000, 35000000, 40000000, 50000000, 60000000, 80000000, 100000000
Resource Requests and Limits#
Resource requests and limits for the Container Cache StatefulSet pods are required. The chart will fail to install without them.
# values.yaml
resources:
requests:
memory: 2Gi
cpu: "1"
limits:
memory: 4Gi
cpu: "2"
Important
Adjust these values based on your cluster size and expected cache throughput. Larger deployments with high pull rates may need more memory and CPU.
Architecture#
Container Cache Architecture#
Container Cache consists of several components:
Nginx Proxy Server: Handles incoming requests and serves cached content
Prometheus Exporter: Provides metrics for monitoring
Persistent Storage: Stores cached container images
DaemonSet: Configures containerd on worker nodes to use the cache
Data Flow#
Initial Request: Worker node requests a container image
Cache Check: Container Cache checks if image is cached locally
Cache Hit: If cached, serve image directly from local storage
Cache Miss: If not cached, fetch from upstream registry and cache locally
Response: Return image to requesting worker node
Performance Considerations#
Cache Size#
The cache size should be sized based on your workload requirements:
Small deployments: 50-100GB
Medium deployments: 100-500GB
Large deployments: 500GB+
Storage Performance#
For optimal performance, use high-performance storage:
AWS: Use gp3 or io1/io2 EBS volumes
Azure: Use Premium SSD storage
GCP: Use SSD persistent disks
Network Configuration#
Container Cache should be deployed close to worker nodes to minimize network latency:
Deploy in the same availability zone as worker nodes
Use high-bandwidth network connections
Consider using dedicated network interfaces for cache traffic
Monitoring and Observability#
Metrics#
Container Cache provides several key metrics:
Cache hit ratio: Percentage of requests served from cache
Cache size: Current size of cached data
Request throughput: Number of requests per second
Response times: Time to serve cached vs. uncached content
Logging#
Container Cache logs include:
Cache hit/miss events
Upstream registry communication
Error conditions and troubleshooting information
Performance metrics
Troubleshooting#
Common Issues#
- Cache Not Working
Verify containerd configuration on worker nodes
Check network connectivity to Container Cache service
Ensure proper DNS resolution
- Low Cache Hit Ratio
Review cache size configuration
Check cache eviction policies
Monitor storage performance
- Storage Issues
Verify storage class availability
Check persistent volume claims
Monitor disk space usage
Best Practices#
Deployment#
Deploy Container Cache before deploying workloads
Use multiple replicas for high availability
Monitor cache performance and adjust configuration as needed
Configuration#
Size cache appropriately for your workload
Use high-performance storage for better performance
Configure appropriate cache eviction policies
Maintenance#
Monitor cache hit ratios and adjust cache size as needed
Regularly review and clean up unused cached images
Update Container Cache images regularly for security patches