Optimize compute, storage and data
Choose copilot or autopilot execution
Continuously improve with reinforcement learning
Every year Datadog creates a report on container usage; we’ll focus on their report from November 2023. Datadog says 65% of containers waste at least 50% of CPU and memory. That means half of all container resources are being wasted. So if two billion containers are spun up each week, almost a billion containers worth of compute resources are wasted, according to the report.
Similar reports from other providers support a staggering level of overprovisioning. Most report waste of at least 50% of CPU and memory.
So the industry has a major problem - everybody overprovisions their compute.
Why do teams overprovision? Most simply don't know what to provision for. Due to this uncertainty, they simply overestimate their needs. The logic is “Let me put double or triple the capacity I think I need so that things are OK when I hit my seasonality peaks and my special use cases”.
The three most common situations we see driving overprovisioning are:
Below are the top priorities of FinOps professionals in the 2024 State of FinOps survey. While overprovisioning is the #1 priority, it is closely followed by managing discounts.
Successful management of discounts can deliver double digit reductions in spend, but can be difficult to manage due to the need to avoid over committing to overall spend amounts and specific resource types.
Overprovisioning and discount management are not only important to engineering budgets but can be important to company-wide financial performance. Below is a simplified Profit & Loss (P&L) statement for a public security SaaS company. For every $100 million in revenue, approximately 8.6% was spent on cloud costs which is at the high end. If the company could save 20-30% of that 8.6%, bringing down cloud costs to 6% of revenue, the company's bottom-line profit would increase 17%.
In addition to cost impacts, the effectiveness of ECS optimization can also affect revenue via the performance and availability of applications if the ECS services play an important role in end customer experience. The need to avoid outages is widely understood, but performance slowdowns can have the same impact as major outages. In an ECS context, latency can be a silent killer if small delays across hundreds of microservices add up to a material impact on user experience.
The example below shows that for a business unit with $100M annual revenue, a 100ms slowdown running across the course of a year has the equivalent impact to an extended 88 hour outage.
This example assumes a 100ms slowdown for users translates to 1% lost revenue. This assumption is based on an early Amazon.com finding. Below is that finding and a series of others:
The overall importance and timeframe of impact will vary by business (e.g., immediate drops in revenue can occur in ecommerce, SaaS impacts would be slower and tied to contract cycles).
Let's jump into some of the challenges that ECS users face:
Let's jump into some of the challenges that ECS users face:
Optimizing each service with respect to all three objectives can be challenging. In this guide we will look at the use of SLOs to allow us to break down this problem. SLOs can help us then approach this problem as minimizing cost, subject to meeting performance and availability needs. We’ll also look at whether workloads can afford to have lower availability thresholds which can allow the use of lower cost spot instances.
Amazon provides multiple controls to optimize Amazon ECS for cost and performance:
It can be challenging to configure all these controls to meet our goals and continually adjust them. Later in the guide we’ll go through solutions to this challenge including manual, automated and autonomous approaches.
We also need to look at how we’re performing, how various settings on these controls perform with different application inputs, which include:
Late in the guide we’ll look at how adaptation is possible under manual, automated and autonomous approaches.
To then relate these inputs and controls to our goals, you need to look at a series of metrics including:
We now have all these complex combinations to look at, even to optimize only one application to achieve the best resource utilization. Many organizations have thousands of applications to manage, so the complexity grows significantly. Managing 100 microservices that are released weekly involves analyzing 7,200 combinations of metrics as shown below. This includes six performance metrics, various traffic patterns, and four monthly releases. Optimizing each service is a complex task that requires careful analysis and monitoring to ensure smooth operation.
Just imagine the challenge of optimizing a fleet of 2,000 microservices, compared to just 100. The sheer number is mind-boggling. Constantly optimizing such a large system on a daily basis is a daunting task. Managing and optimizing a fleet of this size is nearly impossible for any human. This highlights the challenges and complexities involved in the ongoing cost optimization & performance optimization processes.
May 10, 2024
November 20, 2024
Every year Datadog creates a report on container usage; we’ll focus on their report from November 2023. Datadog says 65% of containers waste at least 50% of CPU and memory. That means half of all container resources are being wasted. So if two billion containers are spun up each week, almost a billion containers worth of compute resources are wasted, according to the report.
Similar reports from other providers support a staggering level of overprovisioning. Most report waste of at least 50% of CPU and memory.
So the industry has a major problem - everybody overprovisions their compute.
Why do teams overprovision? Most simply don't know what to provision for. Due to this uncertainty, they simply overestimate their needs. The logic is “Let me put double or triple the capacity I think I need so that things are OK when I hit my seasonality peaks and my special use cases”.
The three most common situations we see driving overprovisioning are:
Below are the top priorities of FinOps professionals in the 2024 State of FinOps survey. While overprovisioning is the #1 priority, it is closely followed by managing discounts.
Successful management of discounts can deliver double digit reductions in spend, but can be difficult to manage due to the need to avoid over committing to overall spend amounts and specific resource types.
Overprovisioning and discount management are not only important to engineering budgets but can be important to company-wide financial performance. Below is a simplified Profit & Loss (P&L) statement for a public security SaaS company. For every $100 million in revenue, approximately 8.6% was spent on cloud costs which is at the high end. If the company could save 20-30% of that 8.6%, bringing down cloud costs to 6% of revenue, the company's bottom-line profit would increase 17%.
In addition to cost impacts, the effectiveness of ECS optimization can also affect revenue via the performance and availability of applications if the ECS services play an important role in end customer experience. The need to avoid outages is widely understood, but performance slowdowns can have the same impact as major outages. In an ECS context, latency can be a silent killer if small delays across hundreds of microservices add up to a material impact on user experience.
The example below shows that for a business unit with $100M annual revenue, a 100ms slowdown running across the course of a year has the equivalent impact to an extended 88 hour outage.
This example assumes a 100ms slowdown for users translates to 1% lost revenue. This assumption is based on an early Amazon.com finding. Below is that finding and a series of others:
The overall importance and timeframe of impact will vary by business (e.g., immediate drops in revenue can occur in ecommerce, SaaS impacts would be slower and tied to contract cycles).
Let's jump into some of the challenges that ECS users face:
Let's jump into some of the challenges that ECS users face:
Optimizing each service with respect to all three objectives can be challenging. In this guide we will look at the use of SLOs to allow us to break down this problem. SLOs can help us then approach this problem as minimizing cost, subject to meeting performance and availability needs. We’ll also look at whether workloads can afford to have lower availability thresholds which can allow the use of lower cost spot instances.
Amazon provides multiple controls to optimize Amazon ECS for cost and performance:
It can be challenging to configure all these controls to meet our goals and continually adjust them. Later in the guide we’ll go through solutions to this challenge including manual, automated and autonomous approaches.
We also need to look at how we’re performing, how various settings on these controls perform with different application inputs, which include:
Late in the guide we’ll look at how adaptation is possible under manual, automated and autonomous approaches.
To then relate these inputs and controls to our goals, you need to look at a series of metrics including:
We now have all these complex combinations to look at, even to optimize only one application to achieve the best resource utilization. Many organizations have thousands of applications to manage, so the complexity grows significantly. Managing 100 microservices that are released weekly involves analyzing 7,200 combinations of metrics as shown below. This includes six performance metrics, various traffic patterns, and four monthly releases. Optimizing each service is a complex task that requires careful analysis and monitoring to ensure smooth operation.
Just imagine the challenge of optimizing a fleet of 2,000 microservices, compared to just 100. The sheer number is mind-boggling. Constantly optimizing such a large system on a daily basis is a daunting task. Managing and optimizing a fleet of this size is nearly impossible for any human. This highlights the challenges and complexities involved in the ongoing cost optimization & performance optimization processes.