Container Cache
This guide provides detailed information on the Container Cache component and its use in NVCF GPU clusters.
Container Cache provides a container caching solution specifically optimized for NGC. It is designed to enhance the efficiency of Docker image pulls from NGC by caching the images locally, reducing network bandwidth and improving pull times for frequently accessed images.
The Container Cache acts as a proxy between your Kubernetes cluster and the NGC registry, caching frequently accessed container images locally to reduce network bandwidth usage and improve deployment times.
Installation
Prerequisites
- A running Kubernetes cluster with
kubectlaccess - Helm >= 3.12
- Credentials for the registry where your NVCF charts and images are stored
Step 1. Authenticate Helm to your chart registry
Authenticate Helm to your OCI registry where the NVCF charts are stored:
Step 2. Create the namespace and image pull secret
Create an image pull secret so that pods can pull container images from your registry.
BYOC / NGC
Amazon ECR
Other Registry
The secret name nvcr-creds is referenced in the values file under images.secrets.
If you use a different secret name, update the values file to match.
Step 3. Create a values file
Create a values.yaml using the complete example in [Base Configuration] below.
- BYOC users pulling directly from NGC can use the
nvcr.io/nvidia/nvcf-byoc/image paths without mirroring. - Self-hosted users should replace the
<your-registry>/<your-repo>placeholders with their mirrored registry path (see the Artifact Manifest in the self-hosted installation guide).
Adjust storageClassName for your environment (e.g., gp3 for AWS EKS).
Step 4. Install the chart
BYOC (NGC Helm Repo)
Self-Hosted (OCI)
Step 5. Verify the installation
Container Cache deploys two workloads:
- A StatefulSet (
container-cache) with the number of replicas set byreplicaCount. Each replica runs two containers (nginx proxy + prometheus exporter) and provisions two PVCs (cacheandproxy-cache). - A DaemonSet (
container-cache-cc) that runs on every node and configures the container runtime (containerd) to route image pulls through the cache.
Base Configuration
The following is a complete example values.yaml for deploying Container Cache.
Copy this file and adjust values for your environment. Each section is explained in detail
below.
- BYOC users can use the
nvcr.io/nvidia/nvcf-byoc/paths directly. - Self-hosted users should replace
<your-registry>/<your-repo>with their mirrored registry path (see the Artifact Manifest in the self-hosted installation guide for source image paths).
Configuration Sections
Replicas
The number of Container Cache pods is controlled through the replicaCount value. Container Cache replicas operate independently and distribute requests from worker nodes.
Container Cache is designed to scale horizontally to handle increased load.
Node Selection
Container Cache pods are scheduled on nodes with appropriate labels to ensure they run on compute nodes. Adjust the node selector based on your cluster’s node labeling scheme.
Target Hosts
The Container Cache can proxy requests to multiple container registries. The default configuration includes both NGC and Docker Hub.
Image Configuration
Container Cache uses specific images for the server, exporter, and certificates:
Replace <your-registry>/<your-repo> with your registry path. BYOC users pulling
directly from NGC can use nvcr.io/nvidia/nvcf-byoc as the registry/repo path.
Cache Configuration
The cache behavior is controlled through several parameters:
Storage Configuration
Container Cache requires persistent storage for caching container images:
Service Configuration
The service type and port can be configured based on your access requirements:
Metrics Configuration
Container Cache includes Prometheus metrics for monitoring cache performance:
Resource Requests and Limits
Resource requests and limits for the Container Cache StatefulSet pods are required. The chart will fail to install without them.
Adjust these values based on your cluster size and expected cache throughput. Larger deployments with high pull rates may need more memory and CPU.
Architecture
Container Cache Architecture
Container Cache consists of several components:
- Nginx Proxy Server: Handles incoming requests and serves cached content
- Prometheus Exporter: Provides metrics for monitoring
- Persistent Storage: Stores cached container images
- DaemonSet: Configures containerd on worker nodes to use the cache
Data Flow
- Initial Request: Worker node requests a container image
- Cache Check: Container Cache checks if image is cached locally
- Cache Hit: If cached, serve image directly from local storage
- Cache Miss: If not cached, fetch from upstream registry and cache locally
- Response: Return image to requesting worker node
Performance Considerations
Cache Size
The cache size should be sized based on your workload requirements:
- Small deployments: 50-100GB
- Medium deployments: 100-500GB
- Large deployments: 500GB+
Storage Performance
For optimal performance, use high-performance storage:
- AWS: Use gp3 or io1/io2 EBS volumes
- Azure: Use Premium SSD storage
- GCP: Use SSD persistent disks
Network Configuration
Container Cache should be deployed close to worker nodes to minimize network latency:
- Deploy in the same availability zone as worker nodes
- Use high-bandwidth network connections
- Consider using dedicated network interfaces for cache traffic
Monitoring and Observability
Metrics
Container Cache provides several key metrics:
- Cache hit ratio: Percentage of requests served from cache
- Cache size: Current size of cached data
- Request throughput: Number of requests per second
- Response times: Time to serve cached vs. uncached content
Logging
Container Cache logs include:
- Cache hit/miss events
- Upstream registry communication
- Error conditions and troubleshooting information
- Performance metrics
Troubleshooting
Common Issues
Cache Not Working : - Verify containerd configuration on worker nodes
- Check network connectivity to Container Cache service
- Ensure proper DNS resolution
Low Cache Hit Ratio : - Review cache size configuration
- Check cache eviction policies
- Monitor storage performance
Storage Issues : - Verify storage class availability
- Check persistent volume claims
- Monitor disk space usage
Best Practices
Deployment
- Deploy Container Cache before deploying workloads
- Use multiple replicas for high availability
- Monitor cache performance and adjust configuration as needed
Configuration
- Size cache appropriately for your workload
- Use high-performance storage for better performance
- Configure appropriate cache eviction policies
Maintenance
- Monitor cache hit ratios and adjust cache size as needed
- Regularly review and clean up unused cached images
- Update Container Cache images regularly for security patches