November 25, 2024
April 14, 2024
November 25, 2024
April 14, 2024
Optimize compute, storage and data
Choose copilot or autopilot execution
Continuously improve with reinforcement learning
In this post, we will cover how to master Amazon ECS optimization using autonomous techniques. The autonomous approach can help with both ECS cost optimization and performance optimization. I’m indebted to two amazing technologists for helping create the technical content below. Firstly, S. Meenakshi who's a staff engineer leading the ECS track in Sedai. I’d also like to thank Nate Singletary, who’s a senior site reliability engineer on the platform team at KnowBe4. We'll go through an overview of Amazon ECS and cover some of the optimization challenges. We'll then walk through the autonomous solution, and summarize KnowBe4’s autonomous journey.
Let’s start with Amazon compute models. Amazon has a shared responsibility model. Amazon does a lot of management for you as a cloud service provider, and they expect you as the user to manage some elements. These vary by compute model. Let’s look at four Amazon compute models from the most to least user effort:
In the table below you can see what AWS manages and what you are expected to manage in each of the four models:
So if you look across all these compute models, they all expect development teams to still manage a number of tasks. Questions you are faced with can include:
So there are a lot of challenges that application teams face when they manage their applications.
ECS has three layers as shown below:
The unit of compute in Amazon ECS is an ECS task. Typically, if you have applications, you define tasks. Each task can have one or more containers. Typically, you will have an app container. You may have an agent or a logging agent running with it. Or you might run containers as a single container application.
They're all deployed as services that can be horizontally scaled. And if it's EC2-based you deploy it on a cluster to access sufficient EC2 capacity. If it's Fargate, it's simpler and you just use Fargate resources.
ECS is widely used. If you look at the numbers previously shared by Amazon at re:Invent:
You may also be an end-user of Amazon ECS as a consumer - many video streaming providers and Amazon.com’s ecommerce site itself run ECS.
Let's jump into some of the challenges that ECS users face.
Every year Datadog creates a report around container usage. The one we’ll focus on here is from November 2023. Datadog says 65% of containers waste at least 50% of CPU and memory. That means half of all container resources are being wasted. So if two billion containers are spun up each week, almost a billion containers worth of compute resources are wasted, according to the report. And we have pulled out similar reports from other providers, which we have talked about in other Sedai presentations. This is a staggering number. Most of them waste at least 50% of CPU and memory.
So the industry has a major problem - everybody overprovisions their compute.
Why do teams overprovision? In the larger picture, they don't know what to provision for. And they simply overestimate their needs to account for that uncertainty. The logic is “Let me put double or triple the capacity I think I need so that things are OK when I hit my seasonality peaks and my special use cases”.
The three most common situations we see driving overprovisioning are:
There are four levers that Amazon provides to optimize Amazon ECS for cost optimization and performance optimization:
Even though you have these levers, or controls, how do you know what to adjust? There are several factors to consider.
We need to think about our goals or objective function, which we can define as minimizing cost, subject to meeting performance and availability needs.
We also need to look at how we’re performing how various settings on these controls perform with different application inputs, which include:
To then relate these inputs and controls to our goals, you need to look at a series of metrics including:
We now have all these complex combinations to look at, even to optimize only one application to achieve the best resource utilization. Many organizations have thousands of applications to manage, so the complexity grows significantly. Managing 100 microservices that are released weekly involves analyzing 7,200 combinations of metrics as shown below. This includes six performance metrics, various traffic patterns, and four monthly releases. It is a complex task that requires careful analysis and monitoring to ensure smooth operation.
Just imagine the challenge of optimizing a fleet of 2,000 microservices, compared to just 100. The sheer number is mind-boggling. It's a daunting task to constantly optimize such a large system on a daily basis. Managing and optimizing a fleet of this size is nearly impossible for any human. This highlights the challenges and complexities involved in the ongoing cost optimization & performance optimization processes.
Before we dive deeper into ECS optimization, let us consider the benefits of having an autonomous system. Having an autonomous system is like having your own operations copilot or autopilot who helps you manage your production environments. And it does so in a safe and hassle-free manner.
Let’s compare an autonomous system to traditional automated systems that are widely deployed today. Autonomous systems can undertake the following steps on its own, guided by the goals given by the user:
Automation comes with challenges because it involves a lot of manual configuration. In the table below we contrast how key activities involved in optimizing ECS for cost, performance and availability differ between an automated and autonomous approach. For example, you have to manually set the thresholds and come up with the metrics that you need to monitor. However, an autonomous system continuously studies the behavior of the application and it adapts accordingly.
When optimizing a service, what we really care about is our goals - are we looking to improve performance? Do we want to improve the cost?
The advantage of autonomous systems is that we just need to let the system know our goals, and then sit back and let the system work its magic.
So let’s now look at the various ways we can optimize services.
So performing these actions in the right order is the key to ensuring that we can run these services as optimized as possible from both the perspective of cost and performance.
We’re now going to look at how Sedai optimizes Amazon ECS cost & performance. Let’s review again the four levers for ECS optimization:
We want to look at all these options and come up with the ideal combination to give you the maximum savings. Let’s look at each and see how Sedai’s autonomous optimization approach progressively provides potential cost gains.
Let’s take a look at rightsizing your ECS services. We want to identify the best possible configuration for your CPU, memory and number of tasks. So we have to consider a lot of factors like the application releases, and the traffic that it encounters:
Sedai looks at metrics including CPU, memory, and traffic over a long period of time and comes up with the ideal configuration. Sedai is aware of the application behavior and learns the traffic patterns. And through reinforcement learning, Sedai keeps fine tuning this configuration and validates it and modifies it after every release.
In the example below, we can see that we have reduced the CPU, increased the memory, and decreased the task count as well. So with just this rightsizing we were able to achieve cost savings of 43%, and the service also ran 25% faster:
With right sizing, we have to keep in mind that we have to do it with respect to typical traffic. So if we want to respond to the peaks and valleys of application traffic, we'll have to use ECS service auto scaling, a particular type of auto scaling.
Auto scaling in ECS can be done at two levels, the service level and the cluster level. Service auto scaling is adjusting the task count. Cluster level is dynamically adjusting your container instance count.
By adding autoscalers to your service, you can increase its availability. And it enables the services to handle requests, even during peak traffic.
The benefits also come in improved cost and performance. Because you don't have to provision for peak traffic, you save costs. And because it auto scales, performance is better during traffic peaks.
When configuring autoscalers, we need to look at the metric that we need to scale against and the threshold associated with it. Sedai determines this metric by looking into your application, whether it's CPU bound or memory bound, and Sedai decides on the metrics. The metrics that can be used include CPU, memory, request count, or even custom application metrics like queue size.
In the example below, we see that we are using a c5a.xlarge instance. There are eight instances. So it's provisioned for peak traffic. And after optimization or after adding autoscaler associated with it, we see that we decreased the amount of instances needed for typical traffic. The desired count is 4, and the max count is 8. So it can scale up to 8 instances when needed to handle the requests.
For workloads with predictable variation in traffic, scheduled scaling will scale capacity at predefined times.
Another ECS cost saving challenge is development and pre-production clusters. These don't need to be running all the time because they're not used all the time. They can be scaled down when they're not in use. To do this we can adopt schedules. Sedai can then manage the shutdown and startup autonomously.
Another option is to use spot instances. By taking into consideration certain factors, like startup time, the nature of the application, whether it's stateful or stateless, Sedai identifies if a service can be run on Spot. By doing so, we can leverage the discounts that AWS offers. Spot is useful for fault tolerant and non-critical stateless workloads.
Let’s take a look at an example, shown below. So in this example, we can see that we have decreased the memory from 10 GB to 4 GB and the CPU from 4 to 2. And just by this rightsizing, we were able to achieve a cost saving of about 52%.
And if we move this service into a spot instance as well, we'll be able to achieve an additional 28% on top of this.
So by rightsizing your instance, adding autoscalers at the service and cluster level, and considering spot, you can ensure that your ECS services are always running at the maximum possible savings.
After performing all these steps, Sedai doesn't just leave ECS workloads at that initial configuration. It continuously optimizes the service to ensure that it's always aligned with the goal that the user has set.
KnowBe4 is the leading provider of security-awareness training and simulated phishing platforms used by over 34,000 organizations globally. KnowBe4 faced an optimization challenge with their Amazon ECS services, leading them to adopt Sedai's autonomous optimization to reduce toil for engineers and improve efficiency. KnowBe4 implemented a three-part approach (Crawl, Walk, Run) to gradually adopt autonomous optimization using Sedai, resulting in significant cost savings and performance gains. KnowBe4 now has 98% of their services running autonomously, with a 27% cost reduction and over 1,100 autonomous actions taken by Sedai per quarter.
For more detail, check out the companion article covering how KnowBe4 implemented autonomous optimization for Amazon ECS.
April 14, 2024
November 25, 2024
In this post, we will cover how to master Amazon ECS optimization using autonomous techniques. The autonomous approach can help with both ECS cost optimization and performance optimization. I’m indebted to two amazing technologists for helping create the technical content below. Firstly, S. Meenakshi who's a staff engineer leading the ECS track in Sedai. I’d also like to thank Nate Singletary, who’s a senior site reliability engineer on the platform team at KnowBe4. We'll go through an overview of Amazon ECS and cover some of the optimization challenges. We'll then walk through the autonomous solution, and summarize KnowBe4’s autonomous journey.
Let’s start with Amazon compute models. Amazon has a shared responsibility model. Amazon does a lot of management for you as a cloud service provider, and they expect you as the user to manage some elements. These vary by compute model. Let’s look at four Amazon compute models from the most to least user effort:
In the table below you can see what AWS manages and what you are expected to manage in each of the four models:
So if you look across all these compute models, they all expect development teams to still manage a number of tasks. Questions you are faced with can include:
So there are a lot of challenges that application teams face when they manage their applications.
ECS has three layers as shown below:
The unit of compute in Amazon ECS is an ECS task. Typically, if you have applications, you define tasks. Each task can have one or more containers. Typically, you will have an app container. You may have an agent or a logging agent running with it. Or you might run containers as a single container application.
They're all deployed as services that can be horizontally scaled. And if it's EC2-based you deploy it on a cluster to access sufficient EC2 capacity. If it's Fargate, it's simpler and you just use Fargate resources.
ECS is widely used. If you look at the numbers previously shared by Amazon at re:Invent:
You may also be an end-user of Amazon ECS as a consumer - many video streaming providers and Amazon.com’s ecommerce site itself run ECS.
Let's jump into some of the challenges that ECS users face.
Every year Datadog creates a report around container usage. The one we’ll focus on here is from November 2023. Datadog says 65% of containers waste at least 50% of CPU and memory. That means half of all container resources are being wasted. So if two billion containers are spun up each week, almost a billion containers worth of compute resources are wasted, according to the report. And we have pulled out similar reports from other providers, which we have talked about in other Sedai presentations. This is a staggering number. Most of them waste at least 50% of CPU and memory.
So the industry has a major problem - everybody overprovisions their compute.
Why do teams overprovision? In the larger picture, they don't know what to provision for. And they simply overestimate their needs to account for that uncertainty. The logic is “Let me put double or triple the capacity I think I need so that things are OK when I hit my seasonality peaks and my special use cases”.
The three most common situations we see driving overprovisioning are:
There are four levers that Amazon provides to optimize Amazon ECS for cost optimization and performance optimization:
Even though you have these levers, or controls, how do you know what to adjust? There are several factors to consider.
We need to think about our goals or objective function, which we can define as minimizing cost, subject to meeting performance and availability needs.
We also need to look at how we’re performing how various settings on these controls perform with different application inputs, which include:
To then relate these inputs and controls to our goals, you need to look at a series of metrics including:
We now have all these complex combinations to look at, even to optimize only one application to achieve the best resource utilization. Many organizations have thousands of applications to manage, so the complexity grows significantly. Managing 100 microservices that are released weekly involves analyzing 7,200 combinations of metrics as shown below. This includes six performance metrics, various traffic patterns, and four monthly releases. It is a complex task that requires careful analysis and monitoring to ensure smooth operation.
Just imagine the challenge of optimizing a fleet of 2,000 microservices, compared to just 100. The sheer number is mind-boggling. It's a daunting task to constantly optimize such a large system on a daily basis. Managing and optimizing a fleet of this size is nearly impossible for any human. This highlights the challenges and complexities involved in the ongoing cost optimization & performance optimization processes.
Before we dive deeper into ECS optimization, let us consider the benefits of having an autonomous system. Having an autonomous system is like having your own operations copilot or autopilot who helps you manage your production environments. And it does so in a safe and hassle-free manner.
Let’s compare an autonomous system to traditional automated systems that are widely deployed today. Autonomous systems can undertake the following steps on its own, guided by the goals given by the user:
Automation comes with challenges because it involves a lot of manual configuration. In the table below we contrast how key activities involved in optimizing ECS for cost, performance and availability differ between an automated and autonomous approach. For example, you have to manually set the thresholds and come up with the metrics that you need to monitor. However, an autonomous system continuously studies the behavior of the application and it adapts accordingly.
When optimizing a service, what we really care about is our goals - are we looking to improve performance? Do we want to improve the cost?
The advantage of autonomous systems is that we just need to let the system know our goals, and then sit back and let the system work its magic.
So let’s now look at the various ways we can optimize services.
So performing these actions in the right order is the key to ensuring that we can run these services as optimized as possible from both the perspective of cost and performance.
We’re now going to look at how Sedai optimizes Amazon ECS cost & performance. Let’s review again the four levers for ECS optimization:
We want to look at all these options and come up with the ideal combination to give you the maximum savings. Let’s look at each and see how Sedai’s autonomous optimization approach progressively provides potential cost gains.
Let’s take a look at rightsizing your ECS services. We want to identify the best possible configuration for your CPU, memory and number of tasks. So we have to consider a lot of factors like the application releases, and the traffic that it encounters:
Sedai looks at metrics including CPU, memory, and traffic over a long period of time and comes up with the ideal configuration. Sedai is aware of the application behavior and learns the traffic patterns. And through reinforcement learning, Sedai keeps fine tuning this configuration and validates it and modifies it after every release.
In the example below, we can see that we have reduced the CPU, increased the memory, and decreased the task count as well. So with just this rightsizing we were able to achieve cost savings of 43%, and the service also ran 25% faster:
With right sizing, we have to keep in mind that we have to do it with respect to typical traffic. So if we want to respond to the peaks and valleys of application traffic, we'll have to use ECS service auto scaling, a particular type of auto scaling.
Auto scaling in ECS can be done at two levels, the service level and the cluster level. Service auto scaling is adjusting the task count. Cluster level is dynamically adjusting your container instance count.
By adding autoscalers to your service, you can increase its availability. And it enables the services to handle requests, even during peak traffic.
The benefits also come in improved cost and performance. Because you don't have to provision for peak traffic, you save costs. And because it auto scales, performance is better during traffic peaks.
When configuring autoscalers, we need to look at the metric that we need to scale against and the threshold associated with it. Sedai determines this metric by looking into your application, whether it's CPU bound or memory bound, and Sedai decides on the metrics. The metrics that can be used include CPU, memory, request count, or even custom application metrics like queue size.
In the example below, we see that we are using a c5a.xlarge instance. There are eight instances. So it's provisioned for peak traffic. And after optimization or after adding autoscaler associated with it, we see that we decreased the amount of instances needed for typical traffic. The desired count is 4, and the max count is 8. So it can scale up to 8 instances when needed to handle the requests.
For workloads with predictable variation in traffic, scheduled scaling will scale capacity at predefined times.
Another ECS cost saving challenge is development and pre-production clusters. These don't need to be running all the time because they're not used all the time. They can be scaled down when they're not in use. To do this we can adopt schedules. Sedai can then manage the shutdown and startup autonomously.
Another option is to use spot instances. By taking into consideration certain factors, like startup time, the nature of the application, whether it's stateful or stateless, Sedai identifies if a service can be run on Spot. By doing so, we can leverage the discounts that AWS offers. Spot is useful for fault tolerant and non-critical stateless workloads.
Let’s take a look at an example, shown below. So in this example, we can see that we have decreased the memory from 10 GB to 4 GB and the CPU from 4 to 2. And just by this rightsizing, we were able to achieve a cost saving of about 52%.
And if we move this service into a spot instance as well, we'll be able to achieve an additional 28% on top of this.
So by rightsizing your instance, adding autoscalers at the service and cluster level, and considering spot, you can ensure that your ECS services are always running at the maximum possible savings.
After performing all these steps, Sedai doesn't just leave ECS workloads at that initial configuration. It continuously optimizes the service to ensure that it's always aligned with the goal that the user has set.
KnowBe4 is the leading provider of security-awareness training and simulated phishing platforms used by over 34,000 organizations globally. KnowBe4 faced an optimization challenge with their Amazon ECS services, leading them to adopt Sedai's autonomous optimization to reduce toil for engineers and improve efficiency. KnowBe4 implemented a three-part approach (Crawl, Walk, Run) to gradually adopt autonomous optimization using Sedai, resulting in significant cost savings and performance gains. KnowBe4 now has 98% of their services running autonomously, with a 27% cost reduction and over 1,100 autonomous actions taken by Sedai per quarter.
For more detail, check out the companion article covering how KnowBe4 implemented autonomous optimization for Amazon ECS.