This post expands on a conversation I had with Suresh on the 1 IDEA podcast.
Sedai has been autonomously optimizing cloud infrastructure for a while. But when customers started asking about rising GPU costs in 2025, we assumed addressing the problem would be similar to what we already optimize.
We’d identify the inefficiency, right-size the resource, and execute the change.
We were very wrong.

A 2025 survey of nearly 1,000 Kubernetes practitioners found that a single GPU running at 30% utilization wastes around $70,000 a year at current H100 rates.
While the waste is well documented, what wasn’t — as a 2025 ACM study confirmed — is how to measure it accurately enough to fix it. When we talked to GPU experts early in this project, they told us: if you can crack the measurement problem, that's going to be genuinely useful.
Up until now, nobody had. So we built the solution ourselves, starting with a metric we had to stop trusting.
Why nvidia-smi GPU Utilization Is the Wrong Metric
When you run nvidia-smi and check GPU utilization, you might see 95% with no errors in the logs. We pulled that number early on and wrote code against it, but it was completely wrong.
nvidia-smi GPU utilization measures whether the GPU is active, not how efficiently it's being used. A small workload can keep the number at 95% simply by occupying the GPU, even if it's only using a fraction of the available compute capacity.
95% doesn't mean the GPU is nearly full; it simply means the GPU is doing something. And if that's the number you're building your optimization logic on, you'll never find the actual waste.
Running the same workload through nvidia-smi and DCGM side by side shows exactly how misleading the default metric is.

Once we accepted that, we hit the next problem: there's no single metric that replaces it.
You have to synthesize memory bandwidth, SM efficiency, SM occupancy, & kernel efficiency to get anywhere close to a real utilization number. Even then, it's still incomplete.
Most well-optimized GPU workloads can only use 30-40% of what the hardware is actually capable of, and standard monitoring won't surface that gap.
Memory compounds this. While CPUs access system RAM directly, GPUs have their own VRAM on the physical card, and the two pools transfer data over a PCIe link — which is slow.
Because of this, your GPU can sit idle waiting for data while every metric in your dashboard looks healthy. Most teams focus on FLOPS & compute specs. The real bottleneck is often the memory transfer, and it barely registers in standard monitoring.
Starting From the Hardware
Because none of the standard metrics gave us a reliable signal, we couldn't build on top of existing tooling. We had to go down to the hardware itself.
If you're from a CUDA background, when metrics don't make sense, you write controlled programs to understand what they're actually measuring. So, we started by writing the simplest possible CUDA program: a single kernel running one thread on one SM.
The goal was to see exactly how each metric behaved with a known, minimal workload before adding any complexity.
We then increased complexity gradually: a compute-heavy workload, then a memory-heavy workload, then different GPU architectures & conditions. We found that each workload type produces a consistent pattern across the metrics.
A compute-bound workload shows high SM occupancy & high FP32 or tensor core activity, but memory bandwidth stays low. That means the cores are busy crunching numbers and rarely waiting on data.
A memory-bound workload flips that pattern: memory bandwidth saturates while SM occupancy stays moderate, because warps are constantly stalling, waiting for data to arrive from VRAM rather than doing useful math.
The tricky part is that nvidia-smi reports the same 95% utilization for both, since all it measures is whether any kernel was running. It can't tell you whether that kernel was actually doing productive work or just sitting in a memory stall.
Training, inference, & batch jobs each have their own signature. Our algorithm came together by learning to recognize those patterns across many signals rather than relying on any single number.
The Validation Problem
Building the algorithm was hard. Validating it was harder, because there is no industry standard for GPU utilization, and NVIDIA doesn't publish one.
A metric threshold that looks healthy for an LLM training workload will look completely different on an inference workload or a batch job, because each workload type uses the GPU differently. There's no external benchmark to check your work against.
So we built our own. For each application type, we observed patterns: what healthy looks like, what inefficient looks like, & what the early warning signs are.
That per-application model is both how we trained the algorithm and how we verified it. When Sedai fixes an inefficiency, the outcome feeds back into the system so the model learns from every action it takes, and the same diagnostic cycle doesn't repeat.
Ready To Optimize Your GPUs?
Book a Sedai demo to speak with a technical expert.

How We Right-Size GPU Allocation in Kubernetes
Having an accurate utilization picture only matters if you can act on it.
Kubernetes by default treats GPUs as indivisible units. A pod requests nvidia.com/gpu: 1 and gets an entire device regardless of actual need.

NVIDIA's Multi-Instance GPU (MIG) partitions a physical GPU into isolated slices with dedicated memory and a defined share of compute. An A100 can be split into configurations like 1g.5gb, 2g.10gb, or 4g.20gb.

The constraint is that you're choosing from a predefined menu; you can't just request 37% of a GPU. In practice, teams pick the closest slice size to what their workload actually needs and accept that it won't be a perfect fit.

Time slicing divides GPU time between workloads but provides no memory isolation, so one compute-heavy workload can consume everything, and it's not suitable for production.
Dynamic Resource Allocation (DRA) is a Kubernetes construct that sits on top of MIG and handles scheduling — it decides which predefined slice a workload gets assigned to. It's promising but not fully production ready. We're in the early stages of adoption.
Here’s how Sedai configures MIG partitions on a GPU node and maps workloads to the resulting slices:
- Sedai creates a
ResourceClaimTemplate. Defines the MIG profile Sedai selected for the workload class.

- Sedai patches each pod to claim a slice. Workload pods reference the template via
resourceClaimsinstead of traditional resource limits.

- Resulting GPU layout. Sedai packs three inference workloads onto one A100. Four slices remain for future pods.

AWS fractional GPU instances (G6F) handle partitioning at the hypervisor level rather than requiring MIG configuration. Memory is fully isolated, but compute is time-sliced. It's a different model than MIG but solves a similar problem. Sedai supports both.
What's Still Unsolved in GPU Optimization
We figured out a lot when it comes to GPU optimization, but in these early stages of the GPU boom, there’s still more to solve.
MIG partitions are static. You define them in advance and they stay fixed, but traffic isn't. A workload at noon looks different from the same workload at 2am.
The next problem is dynamic GPU partitioning, where MIG slices expand & contract automatically based on actual traffic, the same way CPU & memory resources already scale in Kubernetes.
The tooling to do this at the hardware level doesn't exist yet. DRA gets closer, but it's still not production ready. Until it is, partition configurations have to be defined upfront & managed manually when workload patterns shift.
Beyond partitioning, the broader challenge is that GPU infrastructure is still maturing fast. New hardware generations, new ML frameworks, & new Kubernetes primitives are all moving simultaneously. What works today may need to be rebuilt for the next generation of hardware.
The measurement problem we solved for A100s and H100s will need to be revalidated for whatever comes next.
If you want to learn more about how Sedai autonomously optimizes Kubernetes GPU workloads in production, Ethan's post on the GPU Optimization launch covers what we built and how it works.
Sedai's GPU optimization is available now. Stop guessing at GPU utilization; schedule a demo to see how.
