Optimize compute, storage and data
Choose copilot or autopilot execution
Continuously improve with reinforcement learning
Running Kubernetes clusters on spot instances offers substantial cost savings while introducing unique operational considerations. This guide provides a comprehensive approach to implementing and managing Kubernetes workloads on AWS spot instances, combining strategic planning with practical implementation details.
AWS spot instances offer discounts of up to 90% compared to on-demand pricing by utilizing excess AWS capacity. This significant cost reduction comes with the understanding that instances can be reclaimed with just two minutes' notice when AWS needs the capacity back.
Typical cluster configuration:
- Region: us-east-1
- Instance types: m5.large, m5a.large, m4.large
- Base cluster: 10 nodes
- Average monthly utilization: 80%
Monthly Cost Comparison:
On-Demand m5.large (10 nodes): $1,226.40
Spot Instance (mixed): $245.28 - $367.92
Potential monthly savings: $858.48 - $981.12
Note: Prices as of February 2025
Not all Kubernetes workloads are suitable for spot instances. While the cost savings are attractive, the potential for instance interruption means you need to carefully evaluate which workloads can handle occasional disruption. Understanding these characteristics is crucial for successful spot instance implementation, as placing the wrong workload on spot instances can lead to service disruptions and reliability issues.
1. Interruption Tolerance
- Can handle occasional restarts
- Maintain state externally or can recover state
- Have reasonable startup times
- Can be rescheduled to different nodes
2. Flexible Timing
- Not strictly time-critical
- Can retry failed operations
- Have built-in fault tolerance
Development and Testing
- CI/CD pipelines
- Development environments
- Load testing environments
- QA clusters
- Integration testing environments
Production Workloads
- Stateless web applications
- Background job processors
- Batch processing systems
- Data analysis workloads
- Machine learning training jobs
- Horizontally scalable services
Not Recommended for Spot
- Critical databases
- Payment processing systems
- Authentication services
- Session management services
- Single-instance stateful applications
- Single-replica deployments
- Applications with startup times exceeding 2 minutes
- Applications requiring guaranteed graceful shutdown
- Workloads with strict timing requirements
- Services that cannot tolerate occasional restarts
Successfully running Kubernetes on spot instances requires careful configuration of multiple components. Each component plays a crucial role in managing spot instances effectively, from handling interruptions to ensuring proper scaling. The following sections detail the essential configurations needed, walking through the setup of cluster components, node management tools, monitoring, and volume handling. Follow these implementations in order, as later configurations often depend on earlier ones.
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: spot-cluster
region: us-east-1
version: "1.27"
managedNodeGroups:
# On-demand node group for critical workloads
- name: critical-workload-od
instanceType: m5.xlarge
desiredCapacity: 2
minSize: 2
maxSize: 4
availabilityZones: ["us-east-1a", "us-east-1b"]
iam:
withAddonPolicies:
autoScaler: true
spotTermination: true
labels:
workload-type: critical
# Spot instance node group
- name: spot-workers
instanceTypes: ["m5.large", "m5a.large", "m5d.large", "m4.large"]
desiredCapacity: 3
minSize: 1
maxSize: 10
spot: true
availabilityZones: ["us-east-1a", "us-east-1b", "us-east-1c"]
iam:
withAddonPolicies:
autoScaler: true
spotTermination: true
labels:
workload-type: spot-eligible
taints:
- key: spot-instance
value: "true"
effect: PreferNoSchedule
spotAllocationStrategy: capacity-optimized
capacityRebalance: true
Karpenter provides advanced node provisioning capabilities that are particularly valuable for spot instance management. Unlike traditional auto-scaling, Karpenter can rapidly respond to spot instance interruptions and maintain workload availability through intelligent node provisioning. The following configurations demonstrate how to set up Karpenter to effectively manage spot instances while maximizing cost savings and maintaining reliability.
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: spot-node-pool
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
nodeClassRef:
name: default
limits:
cpu: 1000
memory: 1000Gi
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 168h
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2
subnetSelector:
karpenter.sh/discovery: "${CLUSTER_NAME}"
securityGroupSelector:
karpenter.sh/discovery: "${CLUSTER_NAME}"
tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"
The AWS Node Termination Handler is a critical component for managing spot instance lifecycles. It monitors for spot instance interruption notices and ensures workloads are gracefully drained before an instance is terminated. This handler is essential for maintaining application availability during spot instance reclamation. The following configuration sets up robust interruption handling with monitoring and notification capabilities.
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: aws-node-termination-handler
namespace: kube-system
spec:
chart:
spec:
chart: aws-node-termination-handler
sourceRef:
kind: HelmRepository
name: eks-charts
version: "0.21.0"
values:
enableSpotInterruptionDraining: true
enableRebalanceMonitoring: true
enableScheduledEventDraining: true
prometheusMetrics:
enabled: true
webhookURL: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
nodeSelector:
kubernetes.io/os: linux
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "100m"
tolerations:
- operator: "Exists"
Effective monitoring is essential when running Kubernetes on spot instances. Without proper monitoring, you might miss critical events like imminent instance terminations or capacity constraints. The following monitoring configuration provides early warning systems for spot instance interruptions, price spikes, and capacity issues, allowing you to take proactive measures before problems affect your applications.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: spot-monitoring
spec:
groups:
- name: spot.rules
rules:
- alert: SpotTerminationImmediate
expr: aws_spot_termination_notice < 120
labels:
severity: critical
annotations:
description: "Spot instance {{ $labels.instance }} will be terminated in {{ $value }} seconds"
- alert: SpotPriceSpike
expr: avg_over_time(aws_spot_price[1h]) >
avg_over_time(aws_spot_price[24h]) * 1.5
labels:
severity: warning
annotations:
description: "Spot price spike detected for instance type {{ $labels.instance_type }}"
Managing persistent storage with spot instances presents unique challenges. When instances are interrupted, you need to ensure that your persistent volumes can be properly detached and reattached to new nodes, potentially in different availability zones. The following configurations demonstrate how to set up resilient storage solutions that can handle spot instance interruptions while maintaining data accessibility.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: zone-aware-stateful
spec:
replicas: 3
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
volumes:
- name: data
persistentVolumeClaim:
claimName: ebs-claim-us-east-1a
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ebs-sc-multi-az
provisioner: ebs.csi.aws.com
parameters:
type: gp3
csi.storage.k8s.io/fstype: ext4
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
- key: topology.kubernetes.io/zone
values:
- us-east-1a
- us-east-1b
- us-east-1c
Testing your spot instance configuration is crucial for ensuring your applications can handle interruptions gracefully. Simply setting up the components isn't enough; you need to verify that your system responds correctly to spot instance reclamation. The following section provides tools and procedures for testing your spot instance setup, including simulated interruptions and monitoring practices.
{
"experimentTemplate": {
"description": "Spot Instance Interruption Test",
"targets": {
"SpotInstances": {
"resourceType": "aws:ec2:spot-instance",
"selectionMode": "ALL",
"resourceTags": {
"kubernetes.io/cluster": "your-cluster-name"
}
}
},
"actions": {
"SpotInterrupt": {
"actionId": "aws:ec2:send-spot-instance-interruptions",
"parameters": {
"durationBeforeInterruption": "PT2M"
},
"targets": {
"SpotInstances": "SpotInstances"
}
}
},
"stopConditions": [{
"source": "none"
}]
}
}
# Watch node status
kubectl get nodes -l karpenter.sh/capacity-type=spot --watch
# Monitor pod migrations
kubectl get pods -o wide --watch
# Check termination handler logs
kubectl logs -n kube-system -l app=aws-node-termination-handler
Understanding how organizations successfully implement Kubernetes on spot instances provides valuable insights for your own implementation. Here we examine two major companies that have achieved significant success with spot instances.
Delivery Hero, one of the world's largest food delivery networks, successfully transitioned their entire Kubernetes infrastructure to spot instances, demonstrating that spot instances can work at massive scale.
- Complete transition to spot instances within 6 months
- 90% of Kubernetes workloads running on Amazon EKS
- Focus on application resilience and high availability
- Multiple instance redundancy
- Graceful termination scripts
- Production-ready checklists
- Termination notice handlers
- De-scheduler implementation
- 70% reduction in infrastructure costs
- Successfully handling 4x-5x traffic spikes
- Managing 390 different applications across 43 countries
- Improved focus on business innovation
ITV, the UK's largest commercial broadcaster, implemented spot instances to handle growing viewership while optimizing costs during the pandemic.
- 18-month phased migration
- 75% workload migration to EKS
- Incremental spot instance adoption
- 60% cost reduction compared to on-demand
- $150,000 annual compute savings
- Deployment time reduced from 40 to 4 minutes
- Increased spot usage from 9% to 24%
- Start with non-critical workloads
- Create comprehensive checklists
- Focus on application resilience
- Implement proper monitoring
- Use mixed instance types
- Implement robust auto-scaling
- Focus on graceful termination
- Maintain redundancy
- Regular review of spot usage
- Continuous optimization
- Strong monitoring practices
- Clear incident response procedures
- Clear migration strategy
- Focus on application resilience
- Proper auto-scaling implementation
- Comprehensive monitoring
- Regular optimization reviews
These case studies demonstrate that with proper planning and implementation, spot instances can successfully support large-scale production workloads while delivering significant cost savings, regardless of industry or scale.
Successfully running Kubernetes on spot instances requires more than just initial setup - it requires ongoing operational excellence. These best practices have been gathered from real-world experience running production workloads on spot instances. Following these guidelines will help you maintain reliability while maximizing cost savings.
Choosing the right mix of instance types is a critical success factor for spot instance implementations. There are three main approaches to instance type selection, each offering different levels of operational efficiency:
Traditional hands-on approach using AWS tools to analyze historical pricing and availability data across different instance types and availability zones. This method requires regular monitoring and manual adjustments but provides full control over instance selection:
bash
# Check spot pricing history
aws ec2 describe-spot-price-history \
--instance-types m4.xlarge m5.xlarge m6a.xlarge \
--product-description "Linux/UNIX" \
--query 'SpotPriceHistory[*].[AvailabilityZone,SpotPrice,InstanceType]' \
--start-time $(date -d '7 days ago' -u +"%Y-%m-%dT%H:%M:%SZ") \
--output table
Tools like Karpenter and AWS Auto Scaling can automatically select instance types based on predefined rules and policies. This approach reduces manual intervention while maintaining control through configuration:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["m5", "m5a", "m6g", "c5", "c5a", "r5", "r5a"]
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["large", "xlarge", "2xlarge"]
Autonomous cloud optimization tools like Sedai can evaluate the full array of AWS instance types and their costs and recommend (in Copilot mode) or implement (in Autopilot mode) optimal instance selections. These systems use machine learning to:
Similar to instance selection, capacity monitoring and management can be approached at different levels of automation:
Basic Prometheus rules for alerting on capacity issues, requiring manual intervention when problems are detected:
yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: spot-capacity
spec:
groups:
- name: spot.capacity
rules:
- alert: SpotCapacityLow
expr: karpenter_nodes_capacity{capacity_type="spot"} / karpenter_nodes_desired{capacity_type="spot"} < 0.7
for: 15m
labels:
severity: warning
annotations:
description: "Spot capacity is below 70% of desired capacity"
Cluster Autoscaler or Karpenter configurations that automatically respond to capacity changes:
yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 30s
limits:
cpu: 1000
memory: 1000Gi
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
Advanced autonomous platforms can:
EKS Auto Mode simplifies Kubernetes node management but requires careful consideration when using spot instances. While Auto Mode supports spot instances, it requires separate node groups for spot and on-demand workloads - you cannot mix instance types within the same node group. This means you'll need to explicitly define which workloads run on spot instances by creating dedicated spot node groups, as shown in the following configuration example:
# Separate node groups required for spot and on-demand
apiVersion: eks.amazonaws.com/v1alpha1
kind: NodeGroup
metadata:
name: spot-workloads
spec:
clusterName: my-cluster
nodeRole: arn:aws:iam::111122223333:role/eks-node-group-role
capacityType: SPOT
autoScaling:
minSize: 1
maxSize: 10
instanceTypes: AUTO
---
apiVersion: eks.amazonaws.com/v1alpha1
kind: NodeGroup
metadata:
name: ondemand-workloads
spec:
clusterName: my-cluster
nodeRole: arn:aws:iam::111122223333:role/eks-node-group-role
capacityType: ON_DEMAND
autoScaling:
minSize: 1
maxSize: 5
instanceTypes: AUTO
Even with proper configuration and monitoring, issues can arise when running Kubernetes on spot instances. The following section covers common problems you might encounter and provides step-by-step resolution procedures. Understanding these troubleshooting patterns will help you maintain system reliability and quickly resolve issues when they occur.
bash
# Check NTH pods
kubectl get pods -n kube-system | grep termination-handler
# Verify IAM roles
aws iam get-role --role-name NodeTerminationHandlerRole
# Check NTH logs
kubectl logs -n kube-system -l app=aws-node-termination-handler
bash
# Check pod events
kubectl describe pod POD_NAME
# Verify node capacity
kubectl get nodes -o custom-columns=NAME:.metadata.name,CAPACITY:.status.capacity.cpu,ALLOCATABLE:.status.allocatable.cpu
# Check node affinity rules
kubectl get pod POD_NAME -o yaml | grep -A10 affinity
Running Kubernetes on AWS spot instances requires careful planning and robust operational practices. By following this guide's configurations and monitoring recommendations, organizations can achieve significant cost savings while maintaining reliability. Remember to:
1. Start with non-critical workloads
2. Implement comprehensive monitoring
3. Use pod disruption budgets
4. Maintain instance type diversity
5. Regular testing of failover scenarios
February 7, 2025
February 18, 2025
Running Kubernetes clusters on spot instances offers substantial cost savings while introducing unique operational considerations. This guide provides a comprehensive approach to implementing and managing Kubernetes workloads on AWS spot instances, combining strategic planning with practical implementation details.
AWS spot instances offer discounts of up to 90% compared to on-demand pricing by utilizing excess AWS capacity. This significant cost reduction comes with the understanding that instances can be reclaimed with just two minutes' notice when AWS needs the capacity back.
Typical cluster configuration:
- Region: us-east-1
- Instance types: m5.large, m5a.large, m4.large
- Base cluster: 10 nodes
- Average monthly utilization: 80%
Monthly Cost Comparison:
On-Demand m5.large (10 nodes): $1,226.40
Spot Instance (mixed): $245.28 - $367.92
Potential monthly savings: $858.48 - $981.12
Note: Prices as of February 2025
Not all Kubernetes workloads are suitable for spot instances. While the cost savings are attractive, the potential for instance interruption means you need to carefully evaluate which workloads can handle occasional disruption. Understanding these characteristics is crucial for successful spot instance implementation, as placing the wrong workload on spot instances can lead to service disruptions and reliability issues.
1. Interruption Tolerance
- Can handle occasional restarts
- Maintain state externally or can recover state
- Have reasonable startup times
- Can be rescheduled to different nodes
2. Flexible Timing
- Not strictly time-critical
- Can retry failed operations
- Have built-in fault tolerance
Development and Testing
- CI/CD pipelines
- Development environments
- Load testing environments
- QA clusters
- Integration testing environments
Production Workloads
- Stateless web applications
- Background job processors
- Batch processing systems
- Data analysis workloads
- Machine learning training jobs
- Horizontally scalable services
Not Recommended for Spot
- Critical databases
- Payment processing systems
- Authentication services
- Session management services
- Single-instance stateful applications
- Single-replica deployments
- Applications with startup times exceeding 2 minutes
- Applications requiring guaranteed graceful shutdown
- Workloads with strict timing requirements
- Services that cannot tolerate occasional restarts
Successfully running Kubernetes on spot instances requires careful configuration of multiple components. Each component plays a crucial role in managing spot instances effectively, from handling interruptions to ensuring proper scaling. The following sections detail the essential configurations needed, walking through the setup of cluster components, node management tools, monitoring, and volume handling. Follow these implementations in order, as later configurations often depend on earlier ones.
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: spot-cluster
region: us-east-1
version: "1.27"
managedNodeGroups:
# On-demand node group for critical workloads
- name: critical-workload-od
instanceType: m5.xlarge
desiredCapacity: 2
minSize: 2
maxSize: 4
availabilityZones: ["us-east-1a", "us-east-1b"]
iam:
withAddonPolicies:
autoScaler: true
spotTermination: true
labels:
workload-type: critical
# Spot instance node group
- name: spot-workers
instanceTypes: ["m5.large", "m5a.large", "m5d.large", "m4.large"]
desiredCapacity: 3
minSize: 1
maxSize: 10
spot: true
availabilityZones: ["us-east-1a", "us-east-1b", "us-east-1c"]
iam:
withAddonPolicies:
autoScaler: true
spotTermination: true
labels:
workload-type: spot-eligible
taints:
- key: spot-instance
value: "true"
effect: PreferNoSchedule
spotAllocationStrategy: capacity-optimized
capacityRebalance: true
Karpenter provides advanced node provisioning capabilities that are particularly valuable for spot instance management. Unlike traditional auto-scaling, Karpenter can rapidly respond to spot instance interruptions and maintain workload availability through intelligent node provisioning. The following configurations demonstrate how to set up Karpenter to effectively manage spot instances while maximizing cost savings and maintaining reliability.
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: spot-node-pool
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
nodeClassRef:
name: default
limits:
cpu: 1000
memory: 1000Gi
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 168h
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2
subnetSelector:
karpenter.sh/discovery: "${CLUSTER_NAME}"
securityGroupSelector:
karpenter.sh/discovery: "${CLUSTER_NAME}"
tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"
The AWS Node Termination Handler is a critical component for managing spot instance lifecycles. It monitors for spot instance interruption notices and ensures workloads are gracefully drained before an instance is terminated. This handler is essential for maintaining application availability during spot instance reclamation. The following configuration sets up robust interruption handling with monitoring and notification capabilities.
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: aws-node-termination-handler
namespace: kube-system
spec:
chart:
spec:
chart: aws-node-termination-handler
sourceRef:
kind: HelmRepository
name: eks-charts
version: "0.21.0"
values:
enableSpotInterruptionDraining: true
enableRebalanceMonitoring: true
enableScheduledEventDraining: true
prometheusMetrics:
enabled: true
webhookURL: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
nodeSelector:
kubernetes.io/os: linux
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "100m"
tolerations:
- operator: "Exists"
Effective monitoring is essential when running Kubernetes on spot instances. Without proper monitoring, you might miss critical events like imminent instance terminations or capacity constraints. The following monitoring configuration provides early warning systems for spot instance interruptions, price spikes, and capacity issues, allowing you to take proactive measures before problems affect your applications.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: spot-monitoring
spec:
groups:
- name: spot.rules
rules:
- alert: SpotTerminationImmediate
expr: aws_spot_termination_notice < 120
labels:
severity: critical
annotations:
description: "Spot instance {{ $labels.instance }} will be terminated in {{ $value }} seconds"
- alert: SpotPriceSpike
expr: avg_over_time(aws_spot_price[1h]) >
avg_over_time(aws_spot_price[24h]) * 1.5
labels:
severity: warning
annotations:
description: "Spot price spike detected for instance type {{ $labels.instance_type }}"
Managing persistent storage with spot instances presents unique challenges. When instances are interrupted, you need to ensure that your persistent volumes can be properly detached and reattached to new nodes, potentially in different availability zones. The following configurations demonstrate how to set up resilient storage solutions that can handle spot instance interruptions while maintaining data accessibility.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: zone-aware-stateful
spec:
replicas: 3
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
volumes:
- name: data
persistentVolumeClaim:
claimName: ebs-claim-us-east-1a
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ebs-sc-multi-az
provisioner: ebs.csi.aws.com
parameters:
type: gp3
csi.storage.k8s.io/fstype: ext4
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
- key: topology.kubernetes.io/zone
values:
- us-east-1a
- us-east-1b
- us-east-1c
Testing your spot instance configuration is crucial for ensuring your applications can handle interruptions gracefully. Simply setting up the components isn't enough; you need to verify that your system responds correctly to spot instance reclamation. The following section provides tools and procedures for testing your spot instance setup, including simulated interruptions and monitoring practices.
{
"experimentTemplate": {
"description": "Spot Instance Interruption Test",
"targets": {
"SpotInstances": {
"resourceType": "aws:ec2:spot-instance",
"selectionMode": "ALL",
"resourceTags": {
"kubernetes.io/cluster": "your-cluster-name"
}
}
},
"actions": {
"SpotInterrupt": {
"actionId": "aws:ec2:send-spot-instance-interruptions",
"parameters": {
"durationBeforeInterruption": "PT2M"
},
"targets": {
"SpotInstances": "SpotInstances"
}
}
},
"stopConditions": [{
"source": "none"
}]
}
}
# Watch node status
kubectl get nodes -l karpenter.sh/capacity-type=spot --watch
# Monitor pod migrations
kubectl get pods -o wide --watch
# Check termination handler logs
kubectl logs -n kube-system -l app=aws-node-termination-handler
Understanding how organizations successfully implement Kubernetes on spot instances provides valuable insights for your own implementation. Here we examine two major companies that have achieved significant success with spot instances.
Delivery Hero, one of the world's largest food delivery networks, successfully transitioned their entire Kubernetes infrastructure to spot instances, demonstrating that spot instances can work at massive scale.
- Complete transition to spot instances within 6 months
- 90% of Kubernetes workloads running on Amazon EKS
- Focus on application resilience and high availability
- Multiple instance redundancy
- Graceful termination scripts
- Production-ready checklists
- Termination notice handlers
- De-scheduler implementation
- 70% reduction in infrastructure costs
- Successfully handling 4x-5x traffic spikes
- Managing 390 different applications across 43 countries
- Improved focus on business innovation
ITV, the UK's largest commercial broadcaster, implemented spot instances to handle growing viewership while optimizing costs during the pandemic.
- 18-month phased migration
- 75% workload migration to EKS
- Incremental spot instance adoption
- 60% cost reduction compared to on-demand
- $150,000 annual compute savings
- Deployment time reduced from 40 to 4 minutes
- Increased spot usage from 9% to 24%
- Start with non-critical workloads
- Create comprehensive checklists
- Focus on application resilience
- Implement proper monitoring
- Use mixed instance types
- Implement robust auto-scaling
- Focus on graceful termination
- Maintain redundancy
- Regular review of spot usage
- Continuous optimization
- Strong monitoring practices
- Clear incident response procedures
- Clear migration strategy
- Focus on application resilience
- Proper auto-scaling implementation
- Comprehensive monitoring
- Regular optimization reviews
These case studies demonstrate that with proper planning and implementation, spot instances can successfully support large-scale production workloads while delivering significant cost savings, regardless of industry or scale.
Successfully running Kubernetes on spot instances requires more than just initial setup - it requires ongoing operational excellence. These best practices have been gathered from real-world experience running production workloads on spot instances. Following these guidelines will help you maintain reliability while maximizing cost savings.
Choosing the right mix of instance types is a critical success factor for spot instance implementations. There are three main approaches to instance type selection, each offering different levels of operational efficiency:
Traditional hands-on approach using AWS tools to analyze historical pricing and availability data across different instance types and availability zones. This method requires regular monitoring and manual adjustments but provides full control over instance selection:
bash
# Check spot pricing history
aws ec2 describe-spot-price-history \
--instance-types m4.xlarge m5.xlarge m6a.xlarge \
--product-description "Linux/UNIX" \
--query 'SpotPriceHistory[*].[AvailabilityZone,SpotPrice,InstanceType]' \
--start-time $(date -d '7 days ago' -u +"%Y-%m-%dT%H:%M:%SZ") \
--output table
Tools like Karpenter and AWS Auto Scaling can automatically select instance types based on predefined rules and policies. This approach reduces manual intervention while maintaining control through configuration:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["m5", "m5a", "m6g", "c5", "c5a", "r5", "r5a"]
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["large", "xlarge", "2xlarge"]
Autonomous cloud optimization tools like Sedai can evaluate the full array of AWS instance types and their costs and recommend (in Copilot mode) or implement (in Autopilot mode) optimal instance selections. These systems use machine learning to:
Similar to instance selection, capacity monitoring and management can be approached at different levels of automation:
Basic Prometheus rules for alerting on capacity issues, requiring manual intervention when problems are detected:
yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: spot-capacity
spec:
groups:
- name: spot.capacity
rules:
- alert: SpotCapacityLow
expr: karpenter_nodes_capacity{capacity_type="spot"} / karpenter_nodes_desired{capacity_type="spot"} < 0.7
for: 15m
labels:
severity: warning
annotations:
description: "Spot capacity is below 70% of desired capacity"
Cluster Autoscaler or Karpenter configurations that automatically respond to capacity changes:
yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 30s
limits:
cpu: 1000
memory: 1000Gi
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
Advanced autonomous platforms can:
EKS Auto Mode simplifies Kubernetes node management but requires careful consideration when using spot instances. While Auto Mode supports spot instances, it requires separate node groups for spot and on-demand workloads - you cannot mix instance types within the same node group. This means you'll need to explicitly define which workloads run on spot instances by creating dedicated spot node groups, as shown in the following configuration example:
# Separate node groups required for spot and on-demand
apiVersion: eks.amazonaws.com/v1alpha1
kind: NodeGroup
metadata:
name: spot-workloads
spec:
clusterName: my-cluster
nodeRole: arn:aws:iam::111122223333:role/eks-node-group-role
capacityType: SPOT
autoScaling:
minSize: 1
maxSize: 10
instanceTypes: AUTO
---
apiVersion: eks.amazonaws.com/v1alpha1
kind: NodeGroup
metadata:
name: ondemand-workloads
spec:
clusterName: my-cluster
nodeRole: arn:aws:iam::111122223333:role/eks-node-group-role
capacityType: ON_DEMAND
autoScaling:
minSize: 1
maxSize: 5
instanceTypes: AUTO
Even with proper configuration and monitoring, issues can arise when running Kubernetes on spot instances. The following section covers common problems you might encounter and provides step-by-step resolution procedures. Understanding these troubleshooting patterns will help you maintain system reliability and quickly resolve issues when they occur.
bash
# Check NTH pods
kubectl get pods -n kube-system | grep termination-handler
# Verify IAM roles
aws iam get-role --role-name NodeTerminationHandlerRole
# Check NTH logs
kubectl logs -n kube-system -l app=aws-node-termination-handler
bash
# Check pod events
kubectl describe pod POD_NAME
# Verify node capacity
kubectl get nodes -o custom-columns=NAME:.metadata.name,CAPACITY:.status.capacity.cpu,ALLOCATABLE:.status.allocatable.cpu
# Check node affinity rules
kubectl get pod POD_NAME -o yaml | grep -A10 affinity
Running Kubernetes on AWS spot instances requires careful planning and robust operational practices. By following this guide's configurations and monitoring recommendations, organizations can achieve significant cost savings while maintaining reliability. Remember to:
1. Start with non-critical workloads
2. Implement comprehensive monitoring
3. Use pod disruption budgets
4. Maintain instance type diversity
5. Regular testing of failover scenarios