Workload Isolation#
In a multi-tenant Kubernetes deployment, end-users should have the ability to request a dedicated Kubernetes control plane instance (“Kubernetes as a Service”). Kubernetes cluster instances that provide tenants with API access must not be shared between tenants. However, the physical worker hosts running the control plane node virtual machines can be shared across tenants in many deployment scenarios and is a great way to improve utilization of the underlying physical servers.
A worker node is a virtual machine (VM) that is operated on a physical worker host. The hypervisor that hosts the VMs should isolate these VMs and protect against active and passive attacks, including resource contention attacks, memory leaks, CPU cache line leaks, and more. NVIDIA recommends configuring protective measures in the hypervisor to safeguard against hostile or noisy neighbors interfering with physical resources such as GPU, CPU, memory, and network, as well as logical resources such as storage and IP space.
Sharing worker hosts between tenants is a security tradeoff between resource utilization and workload isolation. The orchestration system should be able to assign worker hosts to pools that are dedicated to a specific tenant, or to pools where more than one tenant’s virtual machine worker nodes can be securely deployed on a shared physical server.
A worker node (the worker virtual machine instantiated on a physical worker host) should be assigned to one and one tenant only, and thus also be exclusively assigned to one Kubernetes control plane instance only.
When the VM gets assigned as a worker node to a Kubernetes control plane instance, the control plane may schedule one or many workloads within the VM at a time. Workload isolation within a VM provided by runc or similar process based isolation may be less powerful than the isolation provided by the hypervisor between VMs. Operators and tenants should consider these differences in user-to-user isolation when scheduling workloads within a VM. The Kubernetes control plane should support the tenant’s choice on scheduling one or many workloads per worker node VM.
The worker nodes on a host should be actively monitored for resource contention, abuse, and attack patterns. A worker VM that exhibits unexpected behavior should be appropriately acted upon.