What are the most common mistakes when scaling GPUs with Karpenter on EKS?
The most common mistakes include relying on Auto Scaling Groups (ASGs) for GPUs, over-constraining instance families, trusting default resource reporting for new instance types, paying cold-start costs from image pulls, and ignoring the difference between voluntary and involuntary disruptions. Each of these can lead to inefficiency, wasted costs, or failed workloads. See the full guide above for actionable solutions to each mistake.
Why should you avoid using Auto Scaling Groups (ASGs) for GPU workloads in EKS?
ASGs require you to pre-define capacity, which can leave pods pending if the right instance type isn't available. Karpenter bypasses ASGs by provisioning capacity directly through the EC2 Fleet API, allowing it to launch the exact instance type your workload needs within seconds.
How can over-constraining instance families impact GPU scaling with Karpenter?
Pinning a NodePool to a single instance type (e.g., g5.xlarge) can lead to insufficient capacity errors, especially in the Spot market. Defining broader categories for instance selection increases your chances of finding available and cost-effective capacity.
What is the issue with default resource reporting for NVIDIA P5 and G6 instances?
Some NVIDIA P5 and G6 instances report a GPU count of 0 to the ec2:DescribeInstanceTypes API. This can cause Karpenter to mis-schedule non-GPU pods onto expensive nodes or fail to provision them for ML jobs. Until AWS patches this, explicitly exclude these families from general-purpose NodePools using the NotIn operator.
How can you reduce cold-start costs for GPU workloads in Kubernetes?
Cold starts are costly for GPU workloads due to large container images. To reduce costs, use peer-to-peer image distribution tools like Dragonfly or Spegel, leverage Data on EKS blueprints, and consider lazy loading with Seekable OCI or e-stargz to start containers before the full image downloads.
What are voluntary and involuntary disruptions in Karpenter, and why do they matter for GPU jobs?
Voluntary disruptions are initiated by Karpenter (e.g., consolidation, drift detection) and are typically handled gracefully. Involuntary disruptions are caused by Spot interruptions or hardware failures and are abrupt. For jobs that can't tolerate interruption, configure Karpenter to avoid voluntary disruptions for those workloads.
How does Spot-to-Spot consolidation work in Karpenter for GPU nodes?
Spot-to-Spot consolidation is an experimental feature in Karpenter that allows the system to swap your current Spot instance for a cheaper one as market prices change, further optimizing costs for GPU workloads.
How can you pre-fetch container images to reduce GPU node cold starts?
With Bottlerocket OS, you can pre-seed nodes with large ML images, snapshot the data volume to EBS, and reference the snapshot ID in your EC2NodeClass. This ensures new GPU nodes boot with images already present, reducing cold start times.
What is GPU time-slicing and when should you use it?
GPU time-slicing allows multiple workloads to share a single physical GPU, increasing utilization for inference and dev/test workloads. However, it does not provide memory isolation, so for production workloads requiring isolation, use Multi-Instance GPU (MIG) instead.
How can you protect long-running GPU jobs from voluntary disruptions in Karpenter?
Apply the annotation karpenter.sh/do-not-disrupt: "true" to your pod. This prevents Karpenter from touching the node through automated consolidation or drift until the pod completes or enters a terminal phase.
What is gang scheduling and why is it important for distributed training on GPUs?
Gang scheduling ensures that all pods for a distributed training job start at once. Without it, partial allocation can leave expensive GPUs idle. Tools like Kueue or NVIDIA KAI Scheduler integrate with Karpenter to prevent partial allocation waste.
Why should you pin AMI versions for GPU workloads in EKS?
Pinning AMI versions prevents drift events caused by dynamic AMI aliases (like @latest), which can trigger fleet-wide updates and break CUDA compatibility. Always specify a stable AMI version in your EC2NodeClass for GPU workloads.
How do you configure EFA to fix NCCL timeouts in distributed GPU training?
Ensure your EC2NodeClass references a security group that explicitly allows inbound traffic from itself. This is necessary for Elastic Fabric Adapter (EFA) to support NCCL’s low-latency operations on P4 or P5 instances.
How can you increase pod density on GPU nodes in EKS?
Enable prefix assignment mode in your VPC CNI. This allows each Elastic Network Interface (ENI) to handle more IP addresses, so pod density is limited by GPU power, not network constraints.
Why is Karpenter’s bin-packing model critical for GPU efficiency?
Karpenter’s bin-packing model prioritizes utilization by fitting compatible workloads together and reclaiming wasted capacity before scaling out. This is especially important for GPUs, which are expensive and often underutilized in node-centric models.
How does Sedai help optimize GPU costs and efficiency for AI infrastructure?
Sedai autonomously optimizes GPU and cloud resources, cutting costs and improving efficiency without manual intervention. It leverages machine learning to rightsize workloads, eliminate waste, and proactively resolve issues, ensuring safe and cost-effective AI infrastructure management. Learn more.
What are the key benefits of using Sedai for cloud and GPU optimization?
Sedai reduces cloud costs by up to 50%, improves performance by reducing latency up to 75%, and enhances reliability by proactively resolving issues. It automates routine tasks, delivers up to 6X productivity gains, and supports AWS, Azure, GCP, and Kubernetes environments. Source.
How quickly can Sedai be implemented for cloud or GPU optimization?
Sedai’s setup process takes just 5 minutes for general use cases and up to 15 minutes for specific scenarios like AWS Lambda. It features plug-and-play implementation with agentless integration, making onboarding fast and simple. Source.
What integrations does Sedai support for cloud and GPU management?
What is Sedai’s autonomous cloud optimization platform?
Sedai’s autonomous cloud optimization platform uses machine learning to optimize cloud resources for cost, performance, and availability without manual intervention. It covers compute, storage, and data across AWS, Azure, GCP, and Kubernetes. Source.
How does Sedai’s proactive issue resolution work?
Sedai detects and resolves performance and availability issues before they impact users, reducing failed customer interactions by up to 50% and ensuring seamless operations. Source.
What is Sedai’s Release Intelligence feature?
Release Intelligence tracks changes in cost, latency, and errors for each deployment, improving release quality and minimizing risks during deployments. Source.
What modes of operation does Sedai offer?
Sedai offers Datapilot (observability), Copilot (one-click optimizations), and Autopilot (fully autonomous execution) modes to match different operational needs. Source.
How does Sedai ensure safe and auditable changes in cloud environments?
Sedai integrates with Infrastructure as Code (IaC), IT Service Management (ITSM), and compliance workflows to ensure all changes are safe, validated, and auditable. Source.
Use Cases & Business Impact
What business impact can customers expect from using Sedai?
Customers can achieve up to 50% cloud cost savings, 75% latency reduction, 6X productivity gains, and 50% fewer failed customer interactions. For example, Palo Alto Networks saved $3.5 million, and KnowBe4 achieved 50% cost savings in production. Source.
Who are some of Sedai’s customers?
Notable customers include Palo Alto Networks, HP, Experian, KnowBe4, Expedia, CapitalOne Bank, GSK, and Avis. These companies use Sedai to optimize cloud environments and improve operational efficiency. Source.
What industries does Sedai serve?
Sedai serves industries such as cybersecurity, IT, financial services, security awareness training, travel and hospitality, healthcare, car rental services, retail and e-commerce, SaaS, and digital commerce. Source.
Who is the target audience for Sedai?
Sedai is designed for platform engineers, IT/cloud ops, technology leaders, site reliability engineers (SREs), and FinOps professionals in organizations with significant cloud operations across multiple industries. Source.
Competition & Differentiation
How does Sedai differ from other cloud optimization tools?
Sedai offers 100% autonomous optimization, proactive issue resolution, application-aware intelligence, full-stack cloud coverage, unique release intelligence, and a plug-and-play setup. These features set it apart from competitors that rely on manual adjustments, static rules, or limited scope. Source.
What are Sedai’s unique features for GPU and cloud optimization?
Unique features include 100% autonomous optimization, proactive issue resolution, application-aware intelligence, full-stack coverage, release intelligence, and quick plug-and-play implementation. These address specific use cases and provide a competitive edge. Source.
Security, Compliance & Support
Is Sedai SOC 2 certified?
Yes, Sedai is SOC 2 certified, demonstrating adherence to stringent security and compliance standards. Source.
Where can I find technical documentation for Sedai?
Technical documentation is available at docs.sedai.io/get-started, including setup guides, feature explanations, and troubleshooting resources. Source.
What support options are available for Sedai customers?
Sedai provides personalized onboarding, a dedicated Customer Success Manager for enterprise customers, detailed documentation, a community Slack channel, and email/phone support. A 30-day free trial is also available. Source.
The Ultimate Guide to GPU Scaling With Karpenter
NG
Nikhil Gopinath
Content Writer
March 17, 2026
Featured
The era of "GPU at any cost" has officially ended. As AI and ML move from the research lab to production at scale, the focus is shifting from simply acquiring compute to orchestrating it with precision.
In the Kubernetes world, Karpenter has emerged as the superior tool for this shift. Unlike the legacy Cluster Autoscaler, which treats nodes like rigid blocks in a spreadsheet, Karpenter treats them like fluid resources that can be binned, packed, and swapped in seconds.
If you are running GPU workloads on Amazon EKS, here is your definitive guide of scaling mistakes to stop making and what to do instead.
5 GPU Scaling Mistakes To Stop Making
1. Stop Using ASGs for GPUs
The Cluster Autoscaler relies on pre-defined Auto Scaling Groups (ASGs), forcing you to guess capacity in advance. If your ML training job needs a p4d, and you only have an ASG for g4dn, your pod stays pending forever.
Karpenter bypasses ASGs entirely by provisioning capacity directly through the EC2 Fleet API, allowing it to launch the exact instance type a workload requests within seconds.
2. Stop Over-Constraining Instance Families
A common mistake using Cluster Autoscaler is pinning a NodePool to a single instance type like g5.xlarge. In the Spot market, this is a recipe for insufficient capacity errors. Stop being picky.
Instead, define broad categories in your NodePool. For example:
There is a known architectural "gotcha" with the latest hardware: some NVIDIA P5 and G6 instances currently report a GPU count of 0 to the ec2:DescribeInstanceTypes API.
If you don't account for this, Karpenter might mistakenly schedule non-GPU pods onto these expensive nodes or fail to provision them for your actual ML jobs. Until this is patched, explicitly exclude these families from your general-purpose NodePools using the NotIn operator.
4. Stop Paying Cold-Start Costs From Image Pulls
Cold starts are especially costly for GPU workloads, and AI container images are notoriously bloated, often exceeding 10GB. If a pod spends 5 minutes in ImagePullBackOff while a GPU node sits idle, you burn expensive capacity before any work begins.
Stop relying on default image pulls during cold starts. Explore the Data on EKS (DoEKS) initiative for optimized blueprints, and consider peer-to-peer distribution tools like Dragonfly or Spegel to pull images from neighboring nodes instead of the registry.
Karpenter is aggressive about efficiency, and without a clear distinction between voluntary and involuntary disruptions, teams can lose work unnecessarily or assume Karpenter is unsafe when it is actually behaving as designed.
Voluntary disruptions: Initiated by Karpenter, such as consolidation, drift detection, or expiration, and are typically handled gracefully with advance notice.
Involuntary disruptions: Spot interruptions or hardware failures. These are abrupt and outside of Karpenter’s control.
If a job cannot tolerate stopping mid-run, configure Karpenter to avoid voluntary disruption for that workload.
Sedai Optimizes GPUs For You.
See how Sedai autonomously cuts costs for your AI infrastructure. Safely.
8 Best Practices for GPU Scaling
1. Use Spot-to-Spot Consolidation (With Caution)
Most people know Karpenter can swap On-Demand capacity for Spot instances. But the real pro-move is using the SpotToSpotConsolidation feature. Note that this is still an experimental setting in current versions.
When enabled, Karpenter will continuously monitor the Spot market and swap your current Spot instance for a different Spot instance if it becomes cheaper.
2. Pre-Fetch Images With EBS Snapshots
If you use Bottlerocket, AWS’s minimal, container-optimized operating system for Kubernetes, you can dramatically reduce GPU cold starts by using its dual-volume architecture.
Bottlerocket stores container images on a separate data volume, allowing you to pre-seed nodes with large ML images, snapshot that volume to EBS, and reference the snapshot ID in your EC2NodeClass so new GPU nodes boot with images already present.
3. Use Time-Slicing, But Know Its Limits
Not all GPU workloads require exclusive device access. While large training jobs may need full A100 or H100 instances, many inference and dev/test workloads use only a fraction of available GPU capacity.
GPU time-slicing allows multiple workloads to share a single physical GPU, increasing density and utilization for these lighter use cases. Recent Karpenter releases added native support for multi-resource requests and scheduling capabilities, simplifying GPU slice placement.
Time-slicing provides no memory isolation. One pod can exhaust VRAM and impact all others. For production workloads requiring isolation, use MIG instead.
4. Use the do-not-disrupt Annotation
To protect those 48-hour training jobs from voluntary disruptions (like consolidation), use the karpenter.sh/do-not-disrupt: "true" annotation.
When you apply this to your pod, Karpenter will not touch the node through automated consolidation or drift until the pod completes or enters a terminal phase.
5. Use Gang Scheduling for Distributed Training
Distributed training requires all pods to start at once. If Karpenter provisions 7 out of 8 nodes and the 8th fails, the 7 active GPUs sit idle, wasting expensive cycles.
Start using a gang scheduler like Kueue or the NVIDIA KAI Scheduler. These tools ensure Karpenter only provisions nodes if the entire group can be satisfied, preventing partial allocation waste.
6. Pin AMIs To Prevent Drift
GPU workloads are extremely sensitive to NVIDIA driver versions.
If you use a dynamic AMI alias like @latest, a new EKS AMI release could trigger a drift event, causing Karpenter to recycle your entire GPU fleet to update the drivers — potentially breaking CUDA compatibility.
To maintain a stable environment, always pin your AMI version (e.g., al2023@v20240807) in the EC2NodeClass.
7. Configure EFA to Fix NCCL Timeouts
Running distributed training on P4 or P5 instances requires Elastic Fabric Adapter (EFA) for NCCL’s low-latency collective operations.
A common cause of NCCL timeouts in EFA-enabled clusters is a misconfigured security group. If the EC2NodeClass does not include a self-referential rule allowing all ports and protocols, NCCL traffic can be blocked.
To avoid this, ensure the EC2NodeClass references a security group that explicitly allows inbound traffic from itself.
8. Increase Pod Density Prefix Assignment Mode
GPU instances often have a low limit on the number of pods they can support due to Elastic Network Interfaces (ENI) constraints. If you are time-slicing and want to run 20 pods on one node, you'll hit the IP limit fast.
Enable prefix assignment mode in your VPC CNI. This allows each ENI to handle more IP addresses, ensuring pod density is limited by the GPU's power, not the network's plumbing.
Why Karpenter’s Bin-Packing Model Is Critical for GPU Efficiency
Adopting Karpenter shifts capacity management from a node-centric model to a declarative, bin-packing approach. Instead of immediately provisioning new nodes, Karpenter evaluates existing capacity and rearranges workloads to pack resources more efficiently.
This mindset is especially important for GPUs, where capacity is expensive and often underutilized. Applying bin packing to GPUs prioritizes utilization, fitting compatible workloads together and reclaiming wasted capacity before scaling out.
In a world where GPU capacity is both scarce and expensive, scaling has gone from adding more nodes to orchestrating the right capacity at the right time. Karpenter’s bin-packing model turns GPU scaling on Amazon EKS from a guessing game into a precision system built for performance, efficiency, and cost control.