Sedai Logo

Running Kubernetes on AWS Spot Instances: A Complete Guide for 2025

S

Sedai

Content Writer

February 7, 2025

Running Kubernetes on AWS Spot Instances: A Complete Guide for 2025

Featured

Learn how to run Kubernetes clusters on AWS spot instances with manual, automated, or autonomous management approaches to achieve 90% cost savings while maintaining reliability. Covers manual, automated, and autonomous approaches to instance selection and management.

Introduction

Running Kubernetes clusters on spot instances offers substantial cost savings while introducing unique operational considerations. This guide provides a comprehensive approach to implementing and managing Kubernetes workloads on AWS spot instances, combining strategic planning with practical implementation details.

Understanding Spot Instance Economics

AWS spot instances offer discounts of up to 90% compared to on-demand pricing by utilizing excess AWS capacity. This significant cost reduction comes with the understanding that instances can be reclaimed with just two minutes' notice when AWS needs the capacity back.

Cost Analysis Example

Typical cluster configuration:- Region: us-east-1- Instance types: m5.large, m5a.large, m4.large- Base cluster: 10 nodes- Average monthly utilization: 80%

Monthly Cost Comparison:On-Demand m5.large (10 nodes): $1,226.40Spot Instance (mixed): $245.28 - $367.92Potential monthly savings: $858.48 - $981.12

Note: Prices as of February 2025

Evaluating Workload Suitability

Not all Kubernetes workloads are suitable for spot instances. While the cost savings are attractive, the potential for instance interruption means you need to carefully evaluate which workloads can handle occasional disruption. Understanding these characteristics is crucial for successful spot instance implementation, as placing the wrong workload on spot instances can lead to service disruptions and reliability issues.

Characteristics of Spot-Suitable Workloads

1. Interruption Tolerance   - Can handle occasional restarts   - Maintain state externally or can recover state   - Have reasonable startup times   - Can be rescheduled to different nodes

2. Flexible Timing   - Not strictly time-critical   - Can retry failed operations   - Have built-in fault tolerance

Recommended Types

Development and Testing- CI/CD pipelines- Development environments- Load testing environments- QA clusters- Integration testing environments

Production Workloads- Stateless web applications- Background job processors- Batch processing systems- Data analysis workloads- Machine learning training jobs- Horizontally scalable services

Not Recommended for Spot- Critical databases- Payment processing systems- Authentication services- Session management services- Single-instance stateful applications- Single-replica deployments- Applications with startup times exceeding 2 minutes- Applications requiring guaranteed graceful shutdown- Workloads with strict timing requirements- Services that cannot tolerate occasional restarts

Implementation Guide

Successfully running Kubernetes on spot instances requires careful configuration of multiple components. Each component plays a crucial role in managing spot instances effectively, from handling interruptions to ensuring proper scaling. The following sections detail the essential configurations needed, walking through the setup of cluster components, node management tools, monitoring, and volume handling. Follow these implementations in order, as later configurations often depend on earlier ones.

1. Cluster Configuration

Basic EKS Configuration with Spot Support

2. Karpenter Setup

Karpenter provides advanced node provisioning capabilities that are particularly valuable for spot instance management. Unlike traditional auto-scaling, Karpenter can rapidly respond to spot instance interruptions and maintain workload availability through intelligent node provisioning. The following configurations demonstrate how to set up Karpenter to effectively manage spot instances while maximizing cost savings and maintaining reliability.

NodePool Configuration

EC2NodeClass Configuration

3. Node Termination Handler

The AWS Node Termination Handler is a critical component for managing spot instance lifecycles. It monitors for spot instance interruption notices and ensures workloads are gracefully drained before an instance is terminated. This handler is essential for maintaining application availability during spot instance reclamation. The following configuration sets up robust interruption handling with monitoring and notification capabilities.

4. Monitoring Setup

Effective monitoring is essential when running Kubernetes on spot instances. Without proper monitoring, you might miss critical events like imminent instance terminations or capacity constraints. The following monitoring configuration provides early warning systems for spot instance interruptions, price spikes, and capacity issues, allowing you to take proactive measures before problems affect your applications.

5. Volume Management

Managing persistent storage with spot instances presents unique challenges. When instances are interrupted, you need to ensure that your persistent volumes can be properly detached and reattached to new nodes, potentially in different availability zones. The following configurations demonstrate how to set up resilient storage solutions that can handle spot instance interruptions while maintaining data accessibility.

Zone-Aware StatefulSet

Multi-AZ StorageClass

Testing and Validation

Testing your spot instance configuration is crucial for ensuring your applications can handle interruptions gracefully. Simply setting up the components isn't enough; you need to verify that your system responds correctly to spot instance reclamation. The following section provides tools and procedures for testing your spot instance setup, including simulated interruptions and monitoring practices.

1. AWS Fault Injection Simulator Configuration

2. Monitoring Commands

Real-World Case Studies

Understanding how organizations successfully implement Kubernetes on spot instances provides valuable insights for your own implementation. Here we examine two major companies that have achieved significant success with spot instances.

Case Study 1: Delivery Hero's Global Scale Implementation

Delivery Hero, one of the world's largest food delivery networks, successfully transitioned their entire Kubernetes infrastructure to spot instances, demonstrating that spot instances can work at massive scale.

Implementation Approach

- Complete transition to spot instances within 6 months- 90% of Kubernetes workloads running on Amazon EKS- Focus on application resilience and high availability

Technical Strategy

1. Resilience Improvements

  - Multiple instance redundancy   - Graceful termination scripts   - Production-ready checklists   - Termination notice handlers   - De-scheduler implementation

2. Results

- 70% reduction in infrastructure costs- Successfully handling 4x-5x traffic spikes- Managing 390 different applications across 43 countries- Improved focus on business innovation

Case Study 2: ITV's Broadcast Platform Transformation

ITV, the UK's largest commercial broadcaster, implemented spot instances to handle growing viewership while optimizing costs during the pandemic.

Implementation Highlights

1. Migration Strategy

  - 18-month phased migration   - 75% workload migration to EKS   - Incremental spot instance adoption

2. Results

- 60% cost reduction compared to on-demand- $150,000 annual compute savings- Deployment time reduced from 40 to 4 minutes- Increased spot usage from 9% to 24%

Key Lessons from Both Organizations

1. Implementation Strategy

  - Start with non-critical workloads   - Create comprehensive checklists   - Focus on application resilience   - Implement proper monitoring

2. Technical Considerations

  - Use mixed instance types   - Implement robust auto-scaling   - Focus on graceful termination   - Maintain redundancy

3. Operational Best Practices

  - Regular review of spot usage   - Continuous optimization   - Strong monitoring practices   - Clear incident response procedures

4. Success Factors

  - Clear migration strategy   - Focus on application resilience   - Proper auto-scaling implementation   - Comprehensive monitoring   - Regular optimization reviews

These case studies demonstrate that with proper planning and implementation, spot instances can successfully support large-scale production workloads while delivering significant cost savings, regardless of industry or scale.

Operational Best Practices

Successfully running Kubernetes on spot instances requires more than just initial setup - it requires ongoing operational excellence. These best practices have been gathered from real-world experience running production workloads on spot instances. Following these guidelines will help you maintain reliability while maximizing cost savings.

1. Instance Type Selection

Choosing the right mix of instance types is a critical success factor for spot instance implementations. There are three main approaches to instance type selection, each offering different levels of operational efficiency:

Manual Selection

Traditional hands-on approach using AWS tools to analyze historical pricing and availability data across different instance types and availability zones. This method requires regular monitoring and manual adjustments but provides full control over instance selection:

Automated Selection

Tools like Karpenter and AWS Auto Scaling can automatically select instance types based on predefined rules and policies. This approach reduces manual intervention while maintaining control through configuration:

Autonomous Selection

Autonomous cloud optimization tools like Sedai can evaluate the full array of AWS instance types and their costs and recommend (in Copilot mode) or implement (in Autopilot mode) optimal instance selections. These systems use machine learning to:

  • Analyze historical usage patterns
  • Predict future capacity needs
  • Consider multiple factors including cost, availability, and performance
  • Automatically adjust instance selection based on real-time conditions
  • Provide proactive recommendations for optimization
  • Implement changes automatically while maintaining safety guardrails

2. Capacity Monitoring and Management

Similar to instance selection, capacity monitoring and management can be approached at different levels of automation:

Manual Monitoring

Basic Prometheus rules for alerting on capacity issues, requiring manual intervention when problems are detected:

Automated Monitoring and Management

Cluster Autoscaler or Karpenter configurations that automatically respond to capacity changes:

Autonomous Monitoring and Management

Advanced autonomous platforms can:

  • Predictively scale capacity based on historical patterns
  • Automatically balance workloads across instance types
  • Proactively migrate workloads before capacity issues occur
  • Optimize capacity allocation across multiple dimensions (cost, performance, reliability)
  • Self-tune monitoring thresholds based on application behavior
  • Automatically implement corrective actions while maintaining service levels

EKS Auto Mode Considerations

EKS Auto Mode simplifies Kubernetes node management but requires careful consideration when using spot instances. While Auto Mode supports spot instances, it requires separate node groups for spot and on-demand workloads - you cannot mix instance types within the same node group. This means you'll need to explicitly define which workloads run on spot instances by creating dedicated spot node groups, as shown in the following configuration example:

1. Configuration Example

Troubleshooting Guide

Even with proper configuration and monitoring, issues can arise when running Kubernetes on spot instances. The following section covers common problems you might encounter and provides step-by-step resolution procedures. Understanding these troubleshooting patterns will help you maintain system reliability and quickly resolve issues when they occur.

Common Issues and Solutions

1. Node Termination Handler Not Working

2. Pod Scheduling Problems

Conclusion

Running Kubernetes on AWS spot instances requires careful planning and robust operational practices. By following this guide's configurations and monitoring recommendations, organizations can achieve significant cost savings while maintaining reliability. Remember to:

1. Start with non-critical workloads2. Implement comprehensive monitoring3. Use pod disruption budgets4. Maintain instance type diversity5. Regular testing of failover scenarios

Additional Resources

- AWS Spot Instance Advisor

- EKS Workshop Spot Guide

- Karpenter Documentation

- AWS Node Termination Handler