Learn how Palo Alto Networks is Transforming Platform Engineering with AI Agents. Register here

Attend a Live Product Tour to see Sedai in action.

Register now
More
Close

Running Kubernetes on AWS Spot Instances: A Complete Guide for 2025

Last updated

February 18, 2025

Published
Topics
Last updated

February 18, 2025

Published

Reduce your cloud costs by 50%, safely

  • Optimize compute, storage and data

  • Choose copilot or autopilot execution

  • Continuously improve with reinforcement learning

CONTENTS

Running Kubernetes on AWS Spot Instances: A Complete Guide for 2025

Introduction

Running Kubernetes clusters on spot instances offers substantial cost savings while introducing unique operational considerations. This guide provides a comprehensive approach to implementing and managing Kubernetes workloads on AWS spot instances, combining strategic planning with practical implementation details.

Understanding Spot Instance Economics

AWS spot instances offer discounts of up to 90% compared to on-demand pricing by utilizing excess AWS capacity. This significant cost reduction comes with the understanding that instances can be reclaimed with just two minutes' notice when AWS needs the capacity back.

Cost Analysis Example

Typical cluster configuration:
- Region: us-east-1
- Instance types: m5.large, m5a.large, m4.large
- Base cluster: 10 nodes
- Average monthly utilization: 80%

Monthly Cost Comparison:
On-Demand m5.large (10 nodes): $1,226.40
Spot Instance (mixed): $245.28 - $367.92
Potential monthly savings: $858.48 - $981.12

Note: Prices as of February 2025

Evaluating Workload Suitability

Not all Kubernetes workloads are suitable for spot instances. While the cost savings are attractive, the potential for instance interruption means you need to carefully evaluate which workloads can handle occasional disruption. Understanding these characteristics is crucial for successful spot instance implementation, as placing the wrong workload on spot instances can lead to service disruptions and reliability issues.

Characteristics of Spot-Suitable Workloads

1. Interruption Tolerance
  - Can handle occasional restarts
  - Maintain state externally or can recover state
  - Have reasonable startup times
  - Can be rescheduled to different nodes

2. Flexible Timing
  - Not strictly time-critical
  - Can retry failed operations
  - Have built-in fault tolerance

Recommended Types

Development and Testing
- CI/CD pipelines
- Development environments
- Load testing environments
- QA clusters
- Integration testing environments

Production Workloads
- Stateless web applications
- Background job processors
- Batch processing systems
- Data analysis workloads
- Machine learning training jobs
- Horizontally scalable services

Not Recommended for Spot
- Critical databases
- Payment processing systems
- Authentication services
- Session management services
- Single-instance stateful applications
- Single-replica deployments
- Applications with startup times exceeding 2 minutes
- Applications requiring guaranteed graceful shutdown
- Workloads with strict timing requirements
- Services that cannot tolerate occasional restarts

Implementation Guide

Successfully running Kubernetes on spot instances requires careful configuration of multiple components. Each component plays a crucial role in managing spot instances effectively, from handling interruptions to ensuring proper scaling. The following sections detail the essential configurations needed, walking through the setup of cluster components, node management tools, monitoring, and volume handling. Follow these implementations in order, as later configurations often depend on earlier ones.

1. Cluster Configuration

Basic EKS Configuration with Spot Support

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: spot-cluster
  region: us-east-1
  version: "1.27"

managedNodeGroups:
  # On-demand node group for critical workloads
  - name: critical-workload-od
    instanceType: m5.xlarge
    desiredCapacity: 2
    minSize: 2
    maxSize: 4
    availabilityZones: ["us-east-1a", "us-east-1b"]
    iam:
      withAddonPolicies:
        autoScaler: true
        spotTermination: true
    labels:
      workload-type: critical
    
  # Spot instance node group
  - name: spot-workers
    instanceTypes: ["m5.large", "m5a.large", "m5d.large", "m4.large"]
    desiredCapacity: 3
    minSize: 1
    maxSize: 10
    spot: true
    availabilityZones: ["us-east-1a", "us-east-1b", "us-east-1c"]
    iam:
      withAddonPolicies:
        autoScaler: true
        spotTermination: true
    labels:
      workload-type: spot-eligible
    taints:
      - key: spot-instance
        value: "true"
        effect: PreferNoSchedule
    spotAllocationStrategy: capacity-optimized
    capacityRebalance: true


2. Karpenter Setup

Karpenter provides advanced node provisioning capabilities that are particularly valuable for spot instance management. Unlike traditional auto-scaling, Karpenter can rapidly respond to spot instance interruptions and maintain workload availability through intelligent node provisioning. The following configurations demonstrate how to set up Karpenter to effectively manage spot instances while maximizing cost savings and maintaining reliability.

NodePool Configuration

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: spot-node-pool
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
      nodeClassRef:
        name: default
  limits:
    cpu: 1000
    memory: 1000Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 168h

EC2NodeClass Configuration

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  subnetSelector:
    karpenter.sh/discovery: "${CLUSTER_NAME}"
  securityGroupSelector:
    karpenter.sh/discovery: "${CLUSTER_NAME}"
  tags:
    karpenter.sh/discovery: "${CLUSTER_NAME}"

3. Node Termination Handler

The AWS Node Termination Handler is a critical component for managing spot instance lifecycles. It monitors for spot instance interruption notices and ensures workloads are gracefully drained before an instance is terminated. This handler is essential for maintaining application availability during spot instance reclamation. The following configuration sets up robust interruption handling with monitoring and notification capabilities.

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: aws-node-termination-handler
  namespace: kube-system
spec:
  chart:
    spec:
      chart: aws-node-termination-handler
      sourceRef:
        kind: HelmRepository
        name: eks-charts
      version: "0.21.0"
  values:
    enableSpotInterruptionDraining: true
    enableRebalanceMonitoring: true
    enableScheduledEventDraining: true
    prometheusMetrics:
      enabled: true
    webhookURL: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
    nodeSelector:
      kubernetes.io/os: linux
    resources:
      requests:
        memory: "64Mi"
        cpu: "50m"
      limits:
        memory: "128Mi"
        cpu: "100m"
    tolerations:
      - operator: "Exists"

4. Monitoring Setup

Effective monitoring is essential when running Kubernetes on spot instances. Without proper monitoring, you might miss critical events like imminent instance terminations or capacity constraints. The following monitoring configuration provides early warning systems for spot instance interruptions, price spikes, and capacity issues, allowing you to take proactive measures before problems affect your applications.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: spot-monitoring
spec:
  groups:
  - name: spot.rules
    rules:
    - alert: SpotTerminationImmediate
      expr: aws_spot_termination_notice < 120
      labels:
        severity: critical
      annotations:
        description: "Spot instance {{ $labels.instance }} will be terminated in {{ $value }} seconds"
    
    - alert: SpotPriceSpike
      expr: avg_over_time(aws_spot_price[1h]) > 
            avg_over_time(aws_spot_price[24h]) * 1.5
      labels:
        severity: warning
      annotations:
        description: "Spot price spike detected for instance type {{ $labels.instance_type }}"

5. Volume Management

Managing persistent storage with spot instances presents unique challenges. When instances are interrupted, you need to ensure that your persistent volumes can be properly detached and reattached to new nodes, potentially in different availability zones. The following configurations demonstrate how to set up resilient storage solutions that can handle spot instance interruptions while maintaining data accessibility.

Zone-Aware StatefulSet

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: zone-aware-stateful
spec:
  replicas: 3
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                - us-east-1a
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: ebs-claim-us-east-1a


Multi-AZ StorageClass

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc-multi-az
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  csi.storage.k8s.io/fstype: ext4
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
  - key: topology.kubernetes.io/zone
    values:
    - us-east-1a
    - us-east-1b
    - us-east-1c

Testing and Validation

Testing your spot instance configuration is crucial for ensuring your applications can handle interruptions gracefully. Simply setting up the components isn't enough; you need to verify that your system responds correctly to spot instance reclamation. The following section provides tools and procedures for testing your spot instance setup, including simulated interruptions and monitoring practices.

1. AWS Fault Injection Simulator Configuration

{
    "experimentTemplate": {
        "description": "Spot Instance Interruption Test",
        "targets": {
            "SpotInstances": {
                "resourceType": "aws:ec2:spot-instance",
                "selectionMode": "ALL",
                "resourceTags": {
                    "kubernetes.io/cluster": "your-cluster-name"
                }
            }
        },
        "actions": {
            "SpotInterrupt": {
                "actionId": "aws:ec2:send-spot-instance-interruptions",
                "parameters": {
                    "durationBeforeInterruption": "PT2M"
                },
                "targets": {
                    "SpotInstances": "SpotInstances"
                }
            }
        },
        "stopConditions": [{
            "source": "none"
        }]
    }
}

2. Monitoring Commands

# Watch node status
kubectl get nodes -l karpenter.sh/capacity-type=spot --watch

# Monitor pod migrations
kubectl get pods -o wide --watch

# Check termination handler logs
kubectl logs -n kube-system -l app=aws-node-termination-handler

Real-World Case Studies

Understanding how organizations successfully implement Kubernetes on spot instances provides valuable insights for your own implementation. Here we examine two major companies that have achieved significant success with spot instances.

Case Study 1: Delivery Hero's Global Scale Implementation

Delivery Hero, one of the world's largest food delivery networks, successfully transitioned their entire Kubernetes infrastructure to spot instances, demonstrating that spot instances can work at massive scale.

Implementation Approach

- Complete transition to spot instances within 6 months
- 90% of Kubernetes workloads running on Amazon EKS
- Focus on application resilience and high availability

Technical Strategy

1. Resilience Improvements

  - Multiple instance redundancy
  - Graceful termination scripts
  - Production-ready checklists
  - Termination notice handlers
  - De-scheduler implementation

2. Results

- 70% reduction in infrastructure costs
- Successfully handling 4x-5x traffic spikes
- Managing 390 different applications across 43 countries
- Improved focus on business innovation

Case Study 2: ITV's Broadcast Platform Transformation

ITV, the UK's largest commercial broadcaster, implemented spot instances to handle growing viewership while optimizing costs during the pandemic.

Implementation Highlights

1. Migration Strategy

  - 18-month phased migration
  - 75% workload migration to EKS
  - Incremental spot instance adoption

2. Results

- 60% cost reduction compared to on-demand
- $150,000 annual compute savings
- Deployment time reduced from 40 to 4 minutes
- Increased spot usage from 9% to 24%

Key Lessons from Both Organizations

1. Implementation Strategy

  - Start with non-critical workloads
  - Create comprehensive checklists
  - Focus on application resilience
  - Implement proper monitoring

2. Technical Considerations

  - Use mixed instance types
  - Implement robust auto-scaling
  - Focus on graceful termination
  - Maintain redundancy

3. Operational Best Practices

  - Regular review of spot usage
  - Continuous optimization
  - Strong monitoring practices
  - Clear incident response procedures

4. Success Factors

  - Clear migration strategy
  - Focus on application resilience
  - Proper auto-scaling implementation
  - Comprehensive monitoring
  - Regular optimization reviews

These case studies demonstrate that with proper planning and implementation, spot instances can successfully support large-scale production workloads while delivering significant cost savings, regardless of industry or scale.

Operational Best Practices

Successfully running Kubernetes on spot instances requires more than just initial setup - it requires ongoing operational excellence. These best practices have been gathered from real-world experience running production workloads on spot instances. Following these guidelines will help you maintain reliability while maximizing cost savings.

1. Instance Type Selection

Choosing the right mix of instance types is a critical success factor for spot instance implementations. There are three main approaches to instance type selection, each offering different levels of operational efficiency:

Manual Selection

Traditional hands-on approach using AWS tools to analyze historical pricing and availability data across different instance types and availability zones. This method requires regular monitoring and manual adjustments but provides full control over instance selection:

bash
# Check spot pricing history
aws ec2 describe-spot-price-history \
    --instance-types m4.xlarge m5.xlarge m6a.xlarge \
    --product-description "Linux/UNIX" \
    --query 'SpotPriceHistory[*].[AvailabilityZone,SpotPrice,InstanceType]' \
    --start-time $(date -d '7 days ago' -u +"%Y-%m-%dT%H:%M:%SZ") \
    --output table

Automated Selection

Tools like Karpenter and AWS Auto Scaling can automatically select instance types based on predefined rules and policies. This approach reduces manual intervention while maintaining control through configuration:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["m5", "m5a", "m6g", "c5", "c5a", "r5", "r5a"]
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ["large", "xlarge", "2xlarge"]

Autonomous Selection

Autonomous cloud optimization tools like Sedai can evaluate the full array of AWS instance types and their costs and recommend (in Copilot mode) or implement (in Autopilot mode) optimal instance selections. These systems use machine learning to:

  • Analyze historical usage patterns
  • Predict future capacity needs
  • Consider multiple factors including cost, availability, and performance
  • Automatically adjust instance selection based on real-time conditions
  • Provide proactive recommendations for optimization
  • Implement changes automatically while maintaining safety guardrails

2. Capacity Monitoring and Management

Similar to instance selection, capacity monitoring and management can be approached at different levels of automation:

Manual Monitoring

Basic Prometheus rules for alerting on capacity issues, requiring manual intervention when problems are detected:

yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: spot-capacity
spec:
  groups:
  - name: spot.capacity
    rules:
    - alert: SpotCapacityLow
      expr: karpenter_nodes_capacity{capacity_type="spot"} / karpenter_nodes_desired{capacity_type="spot"} < 0.7
      for: 15m
      labels:
        severity: warning
      annotations:
        description: "Spot capacity is below 70% of desired capacity"

Automated Monitoring and Management

Cluster Autoscaler or Karpenter configurations that automatically respond to capacity changes:

yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s
  limits:
    cpu: 1000
    memory: 1000Gi
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]

Autonomous Monitoring and Management

Advanced autonomous platforms can:

  • Predictively scale capacity based on historical patterns
  • Automatically balance workloads across instance types
  • Proactively migrate workloads before capacity issues occur
  • Optimize capacity allocation across multiple dimensions (cost, performance, reliability)
  • Self-tune monitoring thresholds based on application behavior
  • Automatically implement corrective actions while maintaining service levels

EKS Auto Mode Considerations

EKS Auto Mode simplifies Kubernetes node management but requires careful consideration when using spot instances. While Auto Mode supports spot instances, it requires separate node groups for spot and on-demand workloads - you cannot mix instance types within the same node group. This means you'll need to explicitly define which workloads run on spot instances by creating dedicated spot node groups, as shown in the following configuration example:

1. Configuration Example

# Separate node groups required for spot and on-demand
apiVersion: eks.amazonaws.com/v1alpha1
kind: NodeGroup
metadata:
  name: spot-workloads
spec:
  clusterName: my-cluster
  nodeRole: arn:aws:iam::111122223333:role/eks-node-group-role
  capacityType: SPOT
  autoScaling:
    minSize: 1
    maxSize: 10
  instanceTypes: AUTO

---
apiVersion: eks.amazonaws.com/v1alpha1
kind: NodeGroup
metadata:
  name: ondemand-workloads
spec:
  clusterName: my-cluster
  nodeRole: arn:aws:iam::111122223333:role/eks-node-group-role
  capacityType: ON_DEMAND
  autoScaling:
    minSize: 1
    maxSize: 5
  instanceTypes: AUTO

Troubleshooting Guide

Even with proper configuration and monitoring, issues can arise when running Kubernetes on spot instances. The following section covers common problems you might encounter and provides step-by-step resolution procedures. Understanding these troubleshooting patterns will help you maintain system reliability and quickly resolve issues when they occur.

Common Issues and Solutions

1. Node Termination Handler Not Working

bash
# Check NTH pods
kubectl get pods -n kube-system | grep termination-handler

# Verify IAM roles
aws iam get-role --role-name NodeTerminationHandlerRole

# Check NTH logs
kubectl logs -n kube-system -l app=aws-node-termination-handler


2. Pod Scheduling Problems

bash
# Check pod events
kubectl describe pod POD_NAME

# Verify node capacity
kubectl get nodes -o custom-columns=NAME:.metadata.name,CAPACITY:.status.capacity.cpu,ALLOCATABLE:.status.allocatable.cpu

# Check node affinity rules
kubectl get pod POD_NAME -o yaml | grep -A10 affinity

Conclusion

Running Kubernetes on AWS spot instances requires careful planning and robust operational practices. By following this guide's configurations and monitoring recommendations, organizations can achieve significant cost savings while maintaining reliability. Remember to:

1. Start with non-critical workloads
2. Implement comprehensive monitoring
3. Use pod disruption budgets
4. Maintain instance type diversity
5. Regular testing of failover scenarios

Additional Resources

- AWS Spot Instance Advisor

- EKS Workshop Spot Guide

- Karpenter Documentation

- AWS Node Termination Handler

Was this content helpful?

Thank you for submitting your feedback.
Oops! Something went wrong while submitting the form.

CONTENTS

Running Kubernetes on AWS Spot Instances: A Complete Guide for 2025

Published on
Last updated on

February 18, 2025

Max 3 min
Running Kubernetes on AWS Spot Instances: A Complete Guide for 2025

Introduction

Running Kubernetes clusters on spot instances offers substantial cost savings while introducing unique operational considerations. This guide provides a comprehensive approach to implementing and managing Kubernetes workloads on AWS spot instances, combining strategic planning with practical implementation details.

Understanding Spot Instance Economics

AWS spot instances offer discounts of up to 90% compared to on-demand pricing by utilizing excess AWS capacity. This significant cost reduction comes with the understanding that instances can be reclaimed with just two minutes' notice when AWS needs the capacity back.

Cost Analysis Example

Typical cluster configuration:
- Region: us-east-1
- Instance types: m5.large, m5a.large, m4.large
- Base cluster: 10 nodes
- Average monthly utilization: 80%

Monthly Cost Comparison:
On-Demand m5.large (10 nodes): $1,226.40
Spot Instance (mixed): $245.28 - $367.92
Potential monthly savings: $858.48 - $981.12

Note: Prices as of February 2025

Evaluating Workload Suitability

Not all Kubernetes workloads are suitable for spot instances. While the cost savings are attractive, the potential for instance interruption means you need to carefully evaluate which workloads can handle occasional disruption. Understanding these characteristics is crucial for successful spot instance implementation, as placing the wrong workload on spot instances can lead to service disruptions and reliability issues.

Characteristics of Spot-Suitable Workloads

1. Interruption Tolerance
  - Can handle occasional restarts
  - Maintain state externally or can recover state
  - Have reasonable startup times
  - Can be rescheduled to different nodes

2. Flexible Timing
  - Not strictly time-critical
  - Can retry failed operations
  - Have built-in fault tolerance

Recommended Types

Development and Testing
- CI/CD pipelines
- Development environments
- Load testing environments
- QA clusters
- Integration testing environments

Production Workloads
- Stateless web applications
- Background job processors
- Batch processing systems
- Data analysis workloads
- Machine learning training jobs
- Horizontally scalable services

Not Recommended for Spot
- Critical databases
- Payment processing systems
- Authentication services
- Session management services
- Single-instance stateful applications
- Single-replica deployments
- Applications with startup times exceeding 2 minutes
- Applications requiring guaranteed graceful shutdown
- Workloads with strict timing requirements
- Services that cannot tolerate occasional restarts

Implementation Guide

Successfully running Kubernetes on spot instances requires careful configuration of multiple components. Each component plays a crucial role in managing spot instances effectively, from handling interruptions to ensuring proper scaling. The following sections detail the essential configurations needed, walking through the setup of cluster components, node management tools, monitoring, and volume handling. Follow these implementations in order, as later configurations often depend on earlier ones.

1. Cluster Configuration

Basic EKS Configuration with Spot Support

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: spot-cluster
  region: us-east-1
  version: "1.27"

managedNodeGroups:
  # On-demand node group for critical workloads
  - name: critical-workload-od
    instanceType: m5.xlarge
    desiredCapacity: 2
    minSize: 2
    maxSize: 4
    availabilityZones: ["us-east-1a", "us-east-1b"]
    iam:
      withAddonPolicies:
        autoScaler: true
        spotTermination: true
    labels:
      workload-type: critical
    
  # Spot instance node group
  - name: spot-workers
    instanceTypes: ["m5.large", "m5a.large", "m5d.large", "m4.large"]
    desiredCapacity: 3
    minSize: 1
    maxSize: 10
    spot: true
    availabilityZones: ["us-east-1a", "us-east-1b", "us-east-1c"]
    iam:
      withAddonPolicies:
        autoScaler: true
        spotTermination: true
    labels:
      workload-type: spot-eligible
    taints:
      - key: spot-instance
        value: "true"
        effect: PreferNoSchedule
    spotAllocationStrategy: capacity-optimized
    capacityRebalance: true


2. Karpenter Setup

Karpenter provides advanced node provisioning capabilities that are particularly valuable for spot instance management. Unlike traditional auto-scaling, Karpenter can rapidly respond to spot instance interruptions and maintain workload availability through intelligent node provisioning. The following configurations demonstrate how to set up Karpenter to effectively manage spot instances while maximizing cost savings and maintaining reliability.

NodePool Configuration

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: spot-node-pool
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
      nodeClassRef:
        name: default
  limits:
    cpu: 1000
    memory: 1000Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 168h

EC2NodeClass Configuration

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  subnetSelector:
    karpenter.sh/discovery: "${CLUSTER_NAME}"
  securityGroupSelector:
    karpenter.sh/discovery: "${CLUSTER_NAME}"
  tags:
    karpenter.sh/discovery: "${CLUSTER_NAME}"

3. Node Termination Handler

The AWS Node Termination Handler is a critical component for managing spot instance lifecycles. It monitors for spot instance interruption notices and ensures workloads are gracefully drained before an instance is terminated. This handler is essential for maintaining application availability during spot instance reclamation. The following configuration sets up robust interruption handling with monitoring and notification capabilities.

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: aws-node-termination-handler
  namespace: kube-system
spec:
  chart:
    spec:
      chart: aws-node-termination-handler
      sourceRef:
        kind: HelmRepository
        name: eks-charts
      version: "0.21.0"
  values:
    enableSpotInterruptionDraining: true
    enableRebalanceMonitoring: true
    enableScheduledEventDraining: true
    prometheusMetrics:
      enabled: true
    webhookURL: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
    nodeSelector:
      kubernetes.io/os: linux
    resources:
      requests:
        memory: "64Mi"
        cpu: "50m"
      limits:
        memory: "128Mi"
        cpu: "100m"
    tolerations:
      - operator: "Exists"

4. Monitoring Setup

Effective monitoring is essential when running Kubernetes on spot instances. Without proper monitoring, you might miss critical events like imminent instance terminations or capacity constraints. The following monitoring configuration provides early warning systems for spot instance interruptions, price spikes, and capacity issues, allowing you to take proactive measures before problems affect your applications.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: spot-monitoring
spec:
  groups:
  - name: spot.rules
    rules:
    - alert: SpotTerminationImmediate
      expr: aws_spot_termination_notice < 120
      labels:
        severity: critical
      annotations:
        description: "Spot instance {{ $labels.instance }} will be terminated in {{ $value }} seconds"
    
    - alert: SpotPriceSpike
      expr: avg_over_time(aws_spot_price[1h]) > 
            avg_over_time(aws_spot_price[24h]) * 1.5
      labels:
        severity: warning
      annotations:
        description: "Spot price spike detected for instance type {{ $labels.instance_type }}"

5. Volume Management

Managing persistent storage with spot instances presents unique challenges. When instances are interrupted, you need to ensure that your persistent volumes can be properly detached and reattached to new nodes, potentially in different availability zones. The following configurations demonstrate how to set up resilient storage solutions that can handle spot instance interruptions while maintaining data accessibility.

Zone-Aware StatefulSet

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: zone-aware-stateful
spec:
  replicas: 3
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                - us-east-1a
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: ebs-claim-us-east-1a


Multi-AZ StorageClass

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc-multi-az
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  csi.storage.k8s.io/fstype: ext4
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
  - key: topology.kubernetes.io/zone
    values:
    - us-east-1a
    - us-east-1b
    - us-east-1c

Testing and Validation

Testing your spot instance configuration is crucial for ensuring your applications can handle interruptions gracefully. Simply setting up the components isn't enough; you need to verify that your system responds correctly to spot instance reclamation. The following section provides tools and procedures for testing your spot instance setup, including simulated interruptions and monitoring practices.

1. AWS Fault Injection Simulator Configuration

{
    "experimentTemplate": {
        "description": "Spot Instance Interruption Test",
        "targets": {
            "SpotInstances": {
                "resourceType": "aws:ec2:spot-instance",
                "selectionMode": "ALL",
                "resourceTags": {
                    "kubernetes.io/cluster": "your-cluster-name"
                }
            }
        },
        "actions": {
            "SpotInterrupt": {
                "actionId": "aws:ec2:send-spot-instance-interruptions",
                "parameters": {
                    "durationBeforeInterruption": "PT2M"
                },
                "targets": {
                    "SpotInstances": "SpotInstances"
                }
            }
        },
        "stopConditions": [{
            "source": "none"
        }]
    }
}

2. Monitoring Commands

# Watch node status
kubectl get nodes -l karpenter.sh/capacity-type=spot --watch

# Monitor pod migrations
kubectl get pods -o wide --watch

# Check termination handler logs
kubectl logs -n kube-system -l app=aws-node-termination-handler

Real-World Case Studies

Understanding how organizations successfully implement Kubernetes on spot instances provides valuable insights for your own implementation. Here we examine two major companies that have achieved significant success with spot instances.

Case Study 1: Delivery Hero's Global Scale Implementation

Delivery Hero, one of the world's largest food delivery networks, successfully transitioned their entire Kubernetes infrastructure to spot instances, demonstrating that spot instances can work at massive scale.

Implementation Approach

- Complete transition to spot instances within 6 months
- 90% of Kubernetes workloads running on Amazon EKS
- Focus on application resilience and high availability

Technical Strategy

1. Resilience Improvements

  - Multiple instance redundancy
  - Graceful termination scripts
  - Production-ready checklists
  - Termination notice handlers
  - De-scheduler implementation

2. Results

- 70% reduction in infrastructure costs
- Successfully handling 4x-5x traffic spikes
- Managing 390 different applications across 43 countries
- Improved focus on business innovation

Case Study 2: ITV's Broadcast Platform Transformation

ITV, the UK's largest commercial broadcaster, implemented spot instances to handle growing viewership while optimizing costs during the pandemic.

Implementation Highlights

1. Migration Strategy

  - 18-month phased migration
  - 75% workload migration to EKS
  - Incremental spot instance adoption

2. Results

- 60% cost reduction compared to on-demand
- $150,000 annual compute savings
- Deployment time reduced from 40 to 4 minutes
- Increased spot usage from 9% to 24%

Key Lessons from Both Organizations

1. Implementation Strategy

  - Start with non-critical workloads
  - Create comprehensive checklists
  - Focus on application resilience
  - Implement proper monitoring

2. Technical Considerations

  - Use mixed instance types
  - Implement robust auto-scaling
  - Focus on graceful termination
  - Maintain redundancy

3. Operational Best Practices

  - Regular review of spot usage
  - Continuous optimization
  - Strong monitoring practices
  - Clear incident response procedures

4. Success Factors

  - Clear migration strategy
  - Focus on application resilience
  - Proper auto-scaling implementation
  - Comprehensive monitoring
  - Regular optimization reviews

These case studies demonstrate that with proper planning and implementation, spot instances can successfully support large-scale production workloads while delivering significant cost savings, regardless of industry or scale.

Operational Best Practices

Successfully running Kubernetes on spot instances requires more than just initial setup - it requires ongoing operational excellence. These best practices have been gathered from real-world experience running production workloads on spot instances. Following these guidelines will help you maintain reliability while maximizing cost savings.

1. Instance Type Selection

Choosing the right mix of instance types is a critical success factor for spot instance implementations. There are three main approaches to instance type selection, each offering different levels of operational efficiency:

Manual Selection

Traditional hands-on approach using AWS tools to analyze historical pricing and availability data across different instance types and availability zones. This method requires regular monitoring and manual adjustments but provides full control over instance selection:

bash
# Check spot pricing history
aws ec2 describe-spot-price-history \
    --instance-types m4.xlarge m5.xlarge m6a.xlarge \
    --product-description "Linux/UNIX" \
    --query 'SpotPriceHistory[*].[AvailabilityZone,SpotPrice,InstanceType]' \
    --start-time $(date -d '7 days ago' -u +"%Y-%m-%dT%H:%M:%SZ") \
    --output table

Automated Selection

Tools like Karpenter and AWS Auto Scaling can automatically select instance types based on predefined rules and policies. This approach reduces manual intervention while maintaining control through configuration:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["m5", "m5a", "m6g", "c5", "c5a", "r5", "r5a"]
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ["large", "xlarge", "2xlarge"]

Autonomous Selection

Autonomous cloud optimization tools like Sedai can evaluate the full array of AWS instance types and their costs and recommend (in Copilot mode) or implement (in Autopilot mode) optimal instance selections. These systems use machine learning to:

  • Analyze historical usage patterns
  • Predict future capacity needs
  • Consider multiple factors including cost, availability, and performance
  • Automatically adjust instance selection based on real-time conditions
  • Provide proactive recommendations for optimization
  • Implement changes automatically while maintaining safety guardrails

2. Capacity Monitoring and Management

Similar to instance selection, capacity monitoring and management can be approached at different levels of automation:

Manual Monitoring

Basic Prometheus rules for alerting on capacity issues, requiring manual intervention when problems are detected:

yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: spot-capacity
spec:
  groups:
  - name: spot.capacity
    rules:
    - alert: SpotCapacityLow
      expr: karpenter_nodes_capacity{capacity_type="spot"} / karpenter_nodes_desired{capacity_type="spot"} < 0.7
      for: 15m
      labels:
        severity: warning
      annotations:
        description: "Spot capacity is below 70% of desired capacity"

Automated Monitoring and Management

Cluster Autoscaler or Karpenter configurations that automatically respond to capacity changes:

yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s
  limits:
    cpu: 1000
    memory: 1000Gi
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]

Autonomous Monitoring and Management

Advanced autonomous platforms can:

  • Predictively scale capacity based on historical patterns
  • Automatically balance workloads across instance types
  • Proactively migrate workloads before capacity issues occur
  • Optimize capacity allocation across multiple dimensions (cost, performance, reliability)
  • Self-tune monitoring thresholds based on application behavior
  • Automatically implement corrective actions while maintaining service levels

EKS Auto Mode Considerations

EKS Auto Mode simplifies Kubernetes node management but requires careful consideration when using spot instances. While Auto Mode supports spot instances, it requires separate node groups for spot and on-demand workloads - you cannot mix instance types within the same node group. This means you'll need to explicitly define which workloads run on spot instances by creating dedicated spot node groups, as shown in the following configuration example:

1. Configuration Example

# Separate node groups required for spot and on-demand
apiVersion: eks.amazonaws.com/v1alpha1
kind: NodeGroup
metadata:
  name: spot-workloads
spec:
  clusterName: my-cluster
  nodeRole: arn:aws:iam::111122223333:role/eks-node-group-role
  capacityType: SPOT
  autoScaling:
    minSize: 1
    maxSize: 10
  instanceTypes: AUTO

---
apiVersion: eks.amazonaws.com/v1alpha1
kind: NodeGroup
metadata:
  name: ondemand-workloads
spec:
  clusterName: my-cluster
  nodeRole: arn:aws:iam::111122223333:role/eks-node-group-role
  capacityType: ON_DEMAND
  autoScaling:
    minSize: 1
    maxSize: 5
  instanceTypes: AUTO

Troubleshooting Guide

Even with proper configuration and monitoring, issues can arise when running Kubernetes on spot instances. The following section covers common problems you might encounter and provides step-by-step resolution procedures. Understanding these troubleshooting patterns will help you maintain system reliability and quickly resolve issues when they occur.

Common Issues and Solutions

1. Node Termination Handler Not Working

bash
# Check NTH pods
kubectl get pods -n kube-system | grep termination-handler

# Verify IAM roles
aws iam get-role --role-name NodeTerminationHandlerRole

# Check NTH logs
kubectl logs -n kube-system -l app=aws-node-termination-handler


2. Pod Scheduling Problems

bash
# Check pod events
kubectl describe pod POD_NAME

# Verify node capacity
kubectl get nodes -o custom-columns=NAME:.metadata.name,CAPACITY:.status.capacity.cpu,ALLOCATABLE:.status.allocatable.cpu

# Check node affinity rules
kubectl get pod POD_NAME -o yaml | grep -A10 affinity

Conclusion

Running Kubernetes on AWS spot instances requires careful planning and robust operational practices. By following this guide's configurations and monitoring recommendations, organizations can achieve significant cost savings while maintaining reliability. Remember to:

1. Start with non-critical workloads
2. Implement comprehensive monitoring
3. Use pod disruption budgets
4. Maintain instance type diversity
5. Regular testing of failover scenarios

Additional Resources

- AWS Spot Instance Advisor

- EKS Workshop Spot Guide

- Karpenter Documentation

- AWS Node Termination Handler

Was this content helpful?

Thank you for submitting your feedback.
Oops! Something went wrong while submitting the form.