AI infrastructure spending is growing faster than most organizations can manage. At the center of that growth is GPU compute — powerful, essential, and extraordinarily expensive. And yet, despite the cost, a significant portion of GPU capacity sits largely unused at any given time.
We built Sedai GPU Optimization to fix that. Today, we're announcing its general availability for Kubernetes environments. I’d like to walk you through what we built, why we built it the way we did, and what it means for teams running AI workloads at scale.
The Problem
Over the past year, we've had a consistent theme emerge in conversations with customers and prospects: GPU costs are spiraling, and nobody feels like they have a reliable handle on them.
The pattern is familiar. An AI team needs GPUs for a training job or inference workload. They request more than they need, because the consequences of under-provisioning are painful and the consequences of over-provisioning are, at worst, a bigger cloud bill. Visibility into actual usage is poor, so nobody really knows how much is being wasted. And even when teams suspect they're over-allocated, they're reluctant to make changes. One bad configuration change on a GPU workload can mean a failed training run, a degraded inference service, or an angry ML team.
So the default behavior is to follow the old standard: leave it running, just to be safe, and accept the painful price tag.
The result is that roughly one-third of all GPUs run at less than 15% utilization, while GPU instances can cost 40x more than standard compute based on published cloud pricing. The math on that wastage is brutal.
Why Existing Tools Fall Short
Before building anything, we spent time understanding why this problem persists despite a growing ecosystem of cloud cost and infrastructure tools.
The answer comes down to two things: signal quality and execution confidence.
Signal quality: The most widely used GPU utilization metric — the one reported by nvidia-smi and ingested by most monitoring and FinOps tools — measures whether a GPU is active, not whether it's doing productive work. In the most extreme case, a GPU can report 100% utilization while performing zero actual computation. That means teams are making decisions based on a metric that fundamentally misrepresents what their GPUs are doing.
Execution confidence: Even when teams identify potential savings, they struggle to act. GPU optimization is genuinely complex. Hardware configurations vary widely, ML frameworks add layers of abstraction, and the risk of disrupting a production AI workload is high. Most tools that surface GPU cost recommendations stop there, leaving teams to figure out how to safely implement changes on their own. Many teams never translate those recommendations to action and savings.
What We Built
Sedai GPU Optimization is built around solving both problems simultaneously: synthesize a trustworthy utilization signal, then use it to execute safe, meaningful optimizations automatically.
Here's how it works.
Determining True GPU Usage
The foundation of everything we do is a proprietary utilization model that infers true GPU usage from multiple telemetry signals. This is a massive step forward from just using the standard activity metric. Our model reflects what workloads are actually doing with the hardware, producing a first-class utilization score that drives every optimization decision we make.
Building this model was not easy. GPU metrics are fragmented, hardware differences vary widely, and the interaction between ML frameworks, Kubernetes scheduling, and GPU hardware creates a lot of complexity. But getting the signal right was non-negotiable — everything downstream depends on it.
Three Core Optimization Capabilities
With a reliable utilization model in place, we built three optimization capabilities at launch:
GPU Deallocation identifies Kubernetes workloads that have GPU resources allocated but aren't actively using them. These are quick (and significant) wins — workloads that requested a GPU, don't need it, and are silently burning through the budget with nothing to show for it. Sedai detects these unnecessary allocations, after getting the go-ahead in Copilot mode, executes the change autonomously with full safety checks.

Partitioning targets NVIDIA GPUs that support Multi-Instance GPU and AWS G6 fractional GPU instances. These options slice a single physical GPU into smaller instances, meaning that one large GPU can serve multiple workloads rather than monopolizing a full GPU for each workload.

GPU Node Pool Optimization analyzes how workloads are distributed across GPU devices and recommends repacking them to consolidate onto fewer nodes. This can free up entire GPU devices. It’s not just about reducing allocation, but reclaiming physical hardware that can be redeployed or deprovisioned.

Safe Autonomy at Every Step
One of our core product principles at Sedai is that autonomy should be earned through trust, not assumed from day one. GPU Optimization follows the same Datapilot → Copilot → Autopilot model as the rest of our platform — letting teams start with guided recommendations, progress to one-click execution, and ultimately move toward fully hands-off autonomous execution at their own pace. Teams running production AI workloads need to trust that optimization changes won't break things before they're willing to hand over the wheel. We designed the product to build that trust incrementally.

Who This Is For
GPU Optimization is relevant across several personas, each of whom may have different priorities.
Platform and infrastructure teams get a consistent, safe way to manage GPU resources across Kubernetes clusters, without requiring deep GPU expertise for every team member.
FinOps and finance leaders get measurable, attributable savings on one of the fastest-growing cost categories in their cloud bill, with clear before-and-after reporting at the workload and node pool level.
ML and AI engineering teams benefit indirectly but significantly: better GPU packing means fewer delays waiting for available capacity, and right-sized workloads mean less resource contention across the cluster.
Cloud architects get a consistent optimization approach that works across Kubernetes platforms and distributions — EKS, GKE, AKS, OpenShift, and more — without vendor-specific workarounds.
What's Next
GPU Optimization is the first step in a larger investment in AI infrastructure. The initial release focuses on Kubernetes, where the majority of our customers run their GPU workloads today. We're already working on expanding autonomous execution across all capabilities, broadening platform support, and extending optimization beyond Kubernetes to GPU-based VMs.
The GPU space is moving fast, and we're moving with it, continuing to push GPU Optimization forward as AI infrastructure evolves and new capabilities become possible.
Get Started
Sedai GPU Optimization is available now. Existing Sedai customers can make use of it within their current environment. If you're new to Sedai, you can learn more and book a demo.
If you're carrying GPU cost as a line item and you're not sure what's actually driving it, we'd love to show you what we're seeing.
