Reduce your cloud costs by 50%, safely

Optimize compute, storage and data
Choose copilot or autopilot execution
Continuously improve with reinforcement learning

Mastering Autonomous Optimization for Amazon ECS

Summary

Amazon Elastic Container Service (ECS) is a widely used container orchestration platform that offers different ways to provision, control, and manage capacity.
Overprovisioning in ECS is a major problem, with reports showing that 65% of containers waste at least 50% of CPU and memory resources. There is substantial room to optimize costs.
Four levers for optimizing ECS for cost and performance include service rightsizing, instance rightsizing, purchasing commitments, and spot instances.
Autonomous optimization with tools like Sedai help optimize ECS services for cost and performance by applying these tactics on a continuous basis.
KnowBe4 is an example of a company that has gone through the autonomous optimization journey for Amazon ECS. 98% of their services run autonomously, with a 27% cost reduction and over 1,100 autonomous actions per quarter.

Introduction

In this post, we will cover how to master Amazon ECS optimization using autonomous techniques. The autonomous approach can help with both ECS cost optimization and performance optimization. I’m indebted to two amazing technologists for helping create the technical content below. Firstly, S. Meenakshi who's a staff engineer leading the ECS track in Sedai. I’d also like to thank Nate Singletary, who’s a senior site reliability engineer on the platform team at KnowBe4. We'll go through an overview of Amazon ECS and cover some of the optimization challenges. We'll then walk through the autonomous solution, and summarize KnowBe4’s autonomous journey.

ECS Overview

ECS vs Other Amazon Compute Models

Let’s start with Amazon compute models. Amazon has a shared responsibility model. Amazon does a lot of management for you as a cloud service provider, and they expect you as the user to manage some elements. These vary by compute model. Let’s look at four Amazon compute models from the most to least user effort:

EC2: this is the most flexible, but it needs management. Amazon gives you a lot of features, and you can provision VMs fast. You can plug it into a network fast. But you are expected to undertake patching. You are expected to connect to the right load balancer and other similar tasks.
ECS and EKS: ECS is Amazon's own native container orchestration platform. You have multiple ways to run ECS. You can run ECS on EC2-backed clusters, or you could run ECS on a serverless container cluster, which is a modern way to do it. EKS is Amazon’s Kubernetes managed service.
AWS Fargate: ECS and EKS can be implemented on Fargate. Fargate is how you run containers in more of a serverless fashion.
AWS Lambda: Lambda requires the least management. It’s serverless.

In the table below you can see what AWS manages and what you are expected to manage in each of the four models:

So if you look across all these compute models, they all expect development teams to still manage a number of tasks. Questions you are faced with can include:

What version of code are you deploying?
What's your compute profile?
What's your horizontal compute profile?
How does your application react to different seasonality patterns?

So there are a lot of challenges that application teams face when they manage their applications.

The Three Layers of AWS ECS

ECS has three layers as shown below:

Provisioning. AWS gives you different ways to provision ECS clusters. You could use the command line interface, write your own IAC templates, use the CDK, or use Terraform for example.
Controller. There's a container scheduler, which is part of the ECS control plane fully managed by Amazon.
Capacity. AWS gives you different capacity options. You could run ECS on ECS or Fargate backed instances. And over the past few years Amazon has introduced Outposts and ECS Anywhere which allows you to run your ECS clusters in your own data centers or anywhere you want.

How Is Amazon ECS Structured?

The unit of compute in Amazon ECS is an ECS task. Typically, if you have applications, you define tasks. Each task can have one or more containers. Typically, you will have an app container. You may have an agent or a logging agent running with it. Or you might run containers as a single container application.

They're all deployed as services that can be horizontally scaled. And if it's EC2-based you deploy it on a cluster to access sufficient EC2 capacity. If it's Fargate, it's simpler and you just use Fargate resources.

Worldwide ECS Usage

ECS is widely used. If you look at the numbers previously shared by Amazon at re:Invent:

Over 2.25 billion tasks are launched every week
Tens of thousands of API requests are served per second
ECS runs in 30 AWS regions, 6 continents
65% of all new AWS containers customers use Amazon ECS

You may also be an end-user of Amazon ECS as a consumer - many video streaming providers and Amazon.com’s ecommerce site itself run ECS.

Amazon ECS Overprovisioning Challenges

Let's jump into some of the challenges that ECS users face.

ECS Overprovisioning is a Major Problem

Every year Datadog creates a report around container usage. The one we’ll focus on here is from November 2023. Datadog says 65% of containers waste at least 50% of CPU and memory. That means half of all container resources are being wasted. So if two billion containers are spun up each week, almost a billion containers worth of compute resources are wasted, according to the report. And we have pulled out similar reports from other providers, which we have talked about in other Sedai presentations. This is a staggering number. Most of them waste at least 50% of CPU and memory.

So the industry has a major problem - everybody overprovisions their compute.

Reasons for Overprovisioning in Amazon ECS

Why do teams overprovision? In the larger picture, they don't know what to provision for. And they simply overestimate their needs to account for that uncertainty. The logic is “Let me put double or triple the capacity I think I need so that things are OK when I hit my seasonality peaks and my special use cases”.

The three most common situations we see driving overprovisioning are:

Developers solve for availability/performance, not cost when setting configuration to reduce the risk of deployment failure. Engineering teams focus on releasing features. They're not focused on what runs in production. So they just keep it safe, building in a standard way and overprovisioning.
Application owners also default to overprovisioning to reduce performance and availability risks for services under their control
Organizations may use “standard” or “t-shirt” compute capacity sizes for applications that have different needs

Four Optimization Levers for Amazon ECS

There are four levers that Amazon provides to optimize Amazon ECS for cost optimization and performance optimization:

Service Rightsizing. You can rightsize your ECS services, either vertically (increasing or decreasing the amount of CPU and Memory for a given task or service), or you can horizontally size it by increasing the number of tasks running.
Instance Rightsizing. You can adjust your cluster by managing your container instances if you use an EC2-backed cluster. Key controls include the number of instances, node types and auto scaling groups (ASGs).
Purchasing Commitments (excl. spot): You could use the discounted purchasing commitments that Amazon has given you. There are savings plans and Reserved Instances with multiple options.
Spot Instances: Spot is one of several pricing models for Amazon ECS offered by Amazon. They are a good fit for fault-tolerant use cases. Most commonly this is the case in development pre-production environments where you may be able to use Spot because even if the environment is down for a few minutes, this is not a concern. Even in production, if the application is fault-tolerant, you may be able to use Spot. Spot options include EC2 and Fargate Spot.

Complexity of Application Optimization in Amazon ECS

Even though you have these levers, or controls, how do you know what to adjust? There are several factors to consider.

We need to think about our goals or objective function, which we can define as minimizing cost, subject to meeting performance and availability needs.

We also need to look at how we’re performing how various settings on these controls perform with different application inputs, which include:

Traffic. We need to look at the amount of traffic that's inbound, and look at the seasonality of that traffic (e.g., across the course of the day, on different days of the week, etc).
Releases. New versions of the application may be released daily, and these can change how the application performs in terms of cost and performance, even when traffic is the same e.g., perhaps a more advanced recommendation is added which will slow down performance unless more compute is added.

To then relate these inputs and controls to our goals, you need to look at a series of metrics including:

CPU utilization and memory utilization (which drive cost)
Performance or latency

We now have all these complex combinations to look at, even to optimize only one application to achieve the best resource utilization. Many organizations have thousands of applications to manage, so the complexity grows significantly. Managing 100 microservices that are released weekly involves analyzing 7,200 combinations of metrics as shown below. This includes six performance metrics, various traffic patterns, and four monthly releases. It is a complex task that requires careful analysis and monitoring to ensure smooth operation.

Just imagine the challenge of optimizing a fleet of 2,000 microservices, compared to just 100. The sheer number is mind-boggling. It's a daunting task to constantly optimize such a large system on a daily basis. Managing and optimizing a fleet of this size is nearly impossible for any human. This highlights the challenges and complexities involved in the ongoing cost optimization & performance optimization processes.

Benefits of Autonomous Optimization for Amazon ECS

Before we dive deeper into ECS optimization, let us consider the benefits of having an autonomous system. Having an autonomous system is like having your own operations copilot or autopilot who helps you manage your production environments. And it does so in a safe and hassle-free manner.

Automation vs Autonomous Systems

Let’s compare an autonomous system to traditional automated systems that are widely deployed today. Autonomous systems can undertake the following steps on its own, guided by the goals given by the user:

Detect Problems
Recommend Solutions
Validate
Execute Safely

Automation comes with challenges because it involves a lot of manual configuration. In the table below we contrast how key activities involved in optimizing ECS for cost, performance and availability differ between an automated and autonomous approach. For example, you have to manually set the thresholds and come up with the metrics that you need to monitor. However, an autonomous system continuously studies the behavior of the application and it adapts accordingly.

Focusing on Optimization Goals

When optimizing a service, what we really care about is our goals - are we looking to improve performance? Do we want to improve the cost?

The advantage of autonomous systems is that we just need to let the system know our goals, and then sit back and let the system work its magic.

So let’s now look at the various ways we can optimize services.

So performing these actions in the right order is the key to ensuring that we can run these services as optimized as possible from both the perspective of cost and performance.

ECS Optimization with Sedai

We’re now going to look at how Sedai optimizes Amazon ECS cost & performance. Let’s review again the four levers for ECS optimization:

Service optimization (Savings potential ~25%). Service optimization is about configuring horizontal and vertical scale for the best cost & performance, optimizing memory, CPU and task counts.
Container instance optimization (Savings potential ~25%). We need to figure out the best instance type and amount that suits your application.
Purchasing options (Savings potential ~15%): AWS offers steep discounts of up to 72% when you commit to a certain amount of usage. Various options include compute savings plans with varying levels of flexibility whether you want to change the region, instance type or tenancy after you have purchased these.

Spot instances (Savings potential up to 50%). Spot is unused AWS unused capacity that Amazon lets customers use for a discount of up to 90% discount. The catch with spot is that they can be reclaimed at any time. So spot is suitable only for fault-tolerant and stateless non-critical workloads. If you have a batch service that runs a few times a day, and does certain computations, it's probably an ideal candidate for spot.

We want to look at all these options and come up with the ideal combination to give you the maximum savings. Let’s look at each and see how Sedai’s autonomous optimization approach progressively provides potential cost gains.

Rightsizing ECS Services with Sedai

Let’s take a look at rightsizing your ECS services. We want to identify the best possible configuration for your CPU, memory and number of tasks. So we have to consider a lot of factors like the application releases, and the traffic that it encounters:

Sedai looks at metrics including CPU, memory, and traffic over a long period of time and comes up with the ideal configuration. Sedai is aware of the application behavior and learns the traffic patterns. And through reinforcement learning, Sedai keeps fine tuning this configuration and validates it and modifies it after every release.

In the example below, we can see that we have reduced the CPU, increased the memory, and decreased the task count as well. So with just this rightsizing we were able to achieve cost savings of 43%, and the service also ran 25% faster:

‍

Optimizing ECS Services & Clusters with Auto Scaling

With right sizing, we have to keep in mind that we have to do it with respect to typical traffic. So if we want to respond to the peaks and valleys of application traffic, we'll have to use ECS service auto scaling, a particular type of auto scaling.

Auto scaling in ECS can be done at two levels, the service level and the cluster level. Service auto scaling is adjusting the task count. Cluster level is dynamically adjusting your container instance count.

By adding autoscalers to your service, you can increase its availability. And it enables the services to handle requests, even during peak traffic.

The benefits also come in improved cost and performance. Because you don't have to provision for peak traffic, you save costs. And because it auto scales, performance is better during traffic peaks.

When configuring autoscalers, we need to look at the metric that we need to scale against and the threshold associated with it. Sedai determines this metric by looking into your application, whether it's CPU bound or memory bound, and Sedai decides on the metrics. The metrics that can be used include CPU, memory, request count, or even custom application metrics like queue size.

In the example below, we see that we are using a c5a.xlarge instance. There are eight instances. So it's provisioned for peak traffic. And after optimization or after adding autoscaler associated with it, we see that we decreased the amount of instances needed for typical traffic. The desired count is 4, and the max count is 8. So it can scale up to 8 instances when needed to handle the requests.

For workloads with predictable variation in traffic, scheduled scaling will scale capacity at predefined times.

Development and Pre-Production Test Environment Optimization

Another ECS cost saving challenge is development and pre-production clusters. These don't need to be running all the time because they're not used all the time. They can be scaled down when they're not in use. To do this we can adopt schedules. Sedai can then manage the shutdown and startup autonomously.

Another option is to use spot instances. By taking into consideration certain factors, like startup time, the nature of the application, whether it's stateful or stateless, Sedai identifies if a service can be run on Spot. By doing so, we can leverage the discounts that AWS offers. Spot is useful for fault tolerant and non-critical stateless workloads.

Let’s take a look at an example, shown below. So in this example, we can see that we have decreased the memory from 10 GB to 4 GB and the CPU from 4 to 2. And just by this rightsizing, we were able to achieve a cost saving of about 52%.

And if we move this service into a spot instance as well, we'll be able to achieve an additional 28% on top of this.

So by rightsizing your instance, adding autoscalers at the service and cluster level, and considering spot, you can ensure that your ECS services are always running at the maximum possible savings.

Continuous Service Optimization

After performing all these steps, Sedai doesn't just leave ECS workloads at that initial configuration. It continuously optimizes the service to ensure that it's always aligned with the goal that the user has set.

‍

Applying Autonomous Optimization for Amazon ECS at KnowBe4

KnowBe4 is the leading provider of security-awareness training and simulated phishing platforms used by over 34,000 organizations globally. KnowBe4 faced an optimization challenge with their Amazon ECS services, leading them to adopt Sedai's autonomous optimization to reduce toil for engineers and improve efficiency. KnowBe4 implemented a three-part approach (Crawl, Walk, Run) to gradually adopt autonomous optimization using Sedai, resulting in significant cost savings and performance gains. KnowBe4 now has 98% of their services running autonomously, with a 27% cost reduction and over 1,100 autonomous actions taken by Sedai per quarter.
For more detail, check out the companion article covering how KnowBe4 implemented autonomous optimization for Amazon ECS.

Thank you for submitting your feedback.

Oops! Something went wrong while submitting the form.

Mastering Autonomous Optimization for Amazon ECS

Benjamin Thomas

Published on

April 14, 2024

Last updated on

February 18, 2025

Max 3 min

Summary

Amazon Elastic Container Service (ECS) is a widely used container orchestration platform that offers different ways to provision, control, and manage capacity.
Overprovisioning in ECS is a major problem, with reports showing that 65% of containers waste at least 50% of CPU and memory resources. There is substantial room to optimize costs.
Four levers for optimizing ECS for cost and performance include service rightsizing, instance rightsizing, purchasing commitments, and spot instances.
Autonomous optimization with tools like Sedai help optimize ECS services for cost and performance by applying these tactics on a continuous basis.
KnowBe4 is an example of a company that has gone through the autonomous optimization journey for Amazon ECS. 98% of their services run autonomously, with a 27% cost reduction and over 1,100 autonomous actions per quarter.

Introduction