Artificial Intelligence

Protecting AI Models: Confidential Inferencing on Kubernetes 1.35

You’ve spent months fine-tuning your Large Language Model (LLM). The weights represent millions of dollars in R&D, proprietary data, and competitive advantage. You deploy it to a public cloud Kubernetes cluster. You’ve encrypted the data at rest and set up TLS for transit. You’re safe, right?

Not quite. There is a massive, gaping hole in modern security strategies: the data in use.

When your model runs, it must be decrypted into system memory to process user prompts. In a traditional architecture, a malicious actor with root access to the hypervisor—or even a compromised cloud admin—can simply scan the memory of the virtual machine. They can dump your model weights, steal user prompts, and inject malicious code. This is the “clear text” vulnerability, and it keeps CISOs in high-value industries awake at night.

With the global confidential computing market projected to hit $69.4 billion by 2032, the industry is responding. We are moving toward a “black box” model where the cloud provider can process your data but never actually see it. This article explores how to implement this architecture using the cutting-edge combination of NVIDIA Hopper GPUs, AMD SEV-SNP, and Kubernetes 1.35.

The AI Security Gap: Why Encryption at Rest Isn’t Enough

Most organizations operate under the assumption that encryption covers their bases. They encrypt their S3 buckets, their databases, and their network traffic. However, during the inference phase, that protective layer is stripped away.

The CPU and GPU need direct access to the raw model weights and user data to perform matrix multiplications. This data sits in DRAM and VRAM, completely exposed to the operating system and the hypervisor managing the virtual machine.

The attack vector here is surprisingly simple. If an attacker compromises the host OS or the hypervisor—a vulnerability known as a “hypervisor escape”—they gain visibility into all guest VMs running on that hardware. They can install a memory scraper that siphons off model weights or sensitive healthcare data in real-time. With the average cost of a data breach sitting at $4.45 million, the financial impact of IP theft from an LLM can be catastrophic.

Furthermore, regulatory frameworks like GDPR and HIPAA are increasingly scrutinizing data-in-use. To truly comply, you must ensure that even the infrastructure provider cannot access the processing data.

Hardware Deep Dive: NVIDIA Hopper & AMD SEV-SNP Synergy

To close this gap, we need hardware-level isolation. This is where the synergy between the NVIDIA Hopper architecture (specifically the H100 GPU) and AMD EPYC processors (featuring SEV-SNP) becomes critical.

NVIDIA Hopper’s Confidential Mode

The NVIDIA H100 is the first data center GPU to support a Confidential Computing mode. It essentially creates a hardware-isolated Trusted Execution Environment (TEE) within the GPU itself.

When enabled, the Hopper architecture encrypts the data and code residing inside the GPU memory (VRAM). This encryption is transparent to the user but opaque to the hardware. Even if someone probes the physical PCIe bus or attempts to read VRAM directly, they will only encounter encrypted gibberish. The GPU creates a secure partition for the GPU context, ensuring that the model weights are only decrypted inside the chip’s secure boundary.

AMD SEV-SNP Isolation

While the NVIDIA Hopper secures the GPU, the CPU and system memory need protection too. AMD’s Secure Encrypted Virtualization-Secure Nested Paging (SEV-SNP) provides this layer.

SEV-SNP encrypts the entire memory of a guest Virtual Machine. It creates a barrier between the guest VM and the hypervisor. Crucially, it adds integrity protection, preventing replay attacks where a malicious hypervisor tries to roll back the VM to a previous, vulnerable state. This ensures that the vCPU state and system memory where the application logic resides are invisible to the cloud provider.

The Handshake: Remote Attestation

Powerful hardware is useless if you can’t trust it. How do you know your GPU is actually running in confidential mode and not a compromised emulator? This is done through a process called Remote Attestation.

Before the model loads, the CPU (AMD SEV-SNP) and the GPU (NVIDIA Hopper) generate cryptographic signed statements proving their identity and configuration state. A remote verifier service checks these signatures against a known good registry. Only if both the CPU and GPU prove they are in a trusted, uncompromised state are the encryption keys released to decrypt the model weights inside the TEE.

Kubernetes 1.35 Orchestration: Confidential Containers

Hardware is the foundation, but Kubernetes is the orchestration layer that makes AI scalable. In previous versions of Kubernetes, running GPU workloads inside a confidential environment was notoriously difficult. The challenge lies in the device passthrough. You cannot simply map a GPU into a container that is encrypted inside a VM without breaking the measured launch environment.

Kubernetes 1.35 (projected context based on the current roadmap) addresses this by maturing the integration of Confidential Containers (CoCo), largely driven by the Kata Containers project.

Kata Containers & CoCo

Traditionally, Kubernetes pods share a kernel. This is a security risk. Confidential Containers solve this by running each pod inside its own isolated, lightweight virtual machine. Using Kata Containers with a runtime class like `kata-qemu` or `kata-clh` (Cloud Hypervisor), you can ensure that your pod runs inside an AMD SEV-SNP protected VM.

GPU Device Passthrough in TEEs

The magic in Kubernetes 1.35 is the streamlined handling of PCI-e passthrough within these confidential VMs. The orchestrator now better understands how to assign a physical NVIDIA GPU to a Kata pod utilizing SEV-SNP without compromising the memory encryption.

This involves tight coordination between the NVIDIA Device Plugin and the Kata runtime. The plugin must request the GPU, and the runtime must ensure the GPU is passed through the IOMMU (Input-Output Memory Management Unit) in a way that maintains the integrity of the encrypted memory domain. This evolution allows data scientists to deploy secure AI workloads using standard Kubernetes manifests, simply by changing the RuntimeClass.

Implementation Guide: Building the Secure Inference Pipeline

Ready to try this? Here is the high-level workflow for building a secure inference pipeline on this stack.

Prerequisites

You need hardware that supports these features. On the CPU side, you require AMD EPYC Genoa or Bergamo processors. On the GPU side, you need the H100 (Hopper). Firmware must be configured to enable SEV-SNP in the BIOS, and the host OS kernel must have the necessary drivers (AMD SEV and NVIDIA Confidential Computing drivers) loaded.

Step 1: The Host Owner’s Key

You must configure the Guest Owner policy. This involves generating a key pair that you, as the tenant, control. The cloud provider (Host Owner) cannot see this key. This policy dictates that the VM will only boot if SEV-SNP is active and the measurements match your expectations.

Step 2: The Confidential Pod YAML

Deployment looks mostly standard, with two critical additions. You must specify the `runtimeClassName: kata-qemu` (or equivalent) to trigger the VM isolation, and you must request the NVIDIA resource.

apiVersion: v1
kind: Pod
metadata:
  name: secure-llm-inference
spec:
  runtimeClassName: kata-qemu
  containers:
  - name: inference-server
    image: my-secure-llm:latest
    resources:
      limits:
        nvidia.com/gpu: 1

Step 3: Loading the Model

When the pod starts, do not download the model weights from an internal HTTP server unencrypted. Instead, the application inside the pod should perform remote attestation. Once it verifies it is running inside a genuine Hopper/SEV-SNP environment, it contacts a Key Management Service (KMS) to retrieve the decryption key. The model weights are then decrypted inside the secure memory, never exposing the raw IP to the underlying infrastructure.

Performance Benchmarks: The Cost of Confidentiality

Security always comes with a cost. The question is whether the performance hit is acceptable for your use case.

AMD SEV-SNP introduces a CPU overhead typically ranging from 2% to 5% due to the encryption/decryption cycles on system memory. The NVIDIA Hopper architecture is highly optimized; the encryption engine on the H100 adds negligible overhead to the actual tensor operations. You won’t see a massive drop in tokens-per-second purely from the GPU encryption.

However, the real bottleneck lies in context switching and networking. Moving data in and out of the TEE requires strict validation. If your application requires high-frequency, low-latency communication between the CPU and GPU, or between different nodes, you may notice increased latency. Furthermore, the measured launch process adds a few seconds to the pod startup time. For high-frequency trading or real-time gaming, this might be a hurdle. For batch processing of financial reports or medical imaging analysis, the overhead is a negligible price to pay for absolute security.

Future Outlook: The Confidential AI Stack

We are witnessing a paradigm shift. The collaboration between the Confidential Computing Consortium, CNCF, and hardware vendors is rapidly standardizing how we handle GPU/CPU TEE coordination.

While we focused on NVIDIA Hopper and AMD EPYC today, expect similar capabilities to arrive for other NPUs and accelerators. The Kubernetes ecosystem is moving toward a state where “confidential” becomes just another runtime class option, as ubiquitous as “gpu” or “spot-instance.”

Is this ready for prime time production? If you are dealing with high-value proprietary models or regulated data (PII, PHI), the answer is increasingly yes. For low-value, consumer-facing chat applications where latency is the only metric that matters, standard instances still hold the edge. But as AI regulations tighten, confidential inferencing won’t just be a luxury—it will be a requirement.

Key Takeaways

  • Data in Use is Vulnerable: Traditional encryption leaves model weights and prompts exposed in memory during inference.
  • Hardware Synergy: NVIDIA Hopper (GPU TEE) and AMD SEV-SNP (CPU/VM TEE) provide full-stack isolation, ensuring the hypervisor cannot read your data.
  • Kubernetes 1.35 Evolution: Newer K8s releases, via Confidential Containers and Kata Containers, streamline GPU passthrough into encrypted VMs.
  • Attestation is Key: Never trust a cloud environment blindly; use remote attestation to verify the hardware state before decrypting models.
  • Manageable Overhead: Expect minor CPU overhead (2-5%) and startup latency, but GPU performance remains largely unaffected.

Ready to secure your AI infrastructure? Start by auditing your current inference pipelines for clear-text vulnerabilities and explore the Kata Containers documentation for your specific cloud provider.

Rody

Founder & CEO · RodyTech LLC

Founder of RodyTech LLC — building AI agents, automation systems, and software for businesses that want to move faster. Based in Iowa. I write about what I actually build and deploy, not theory.

No comments yet

Leave a comment

Your email address will not be published. Required fields are marked *