Frequently Asked Questions

Amazon ECS Optimization Challenges & Cost Management

Why is overprovisioning such a significant issue in Amazon ECS environments?

Overprovisioning is a major problem in Amazon ECS because about 65% of containers waste at least 50% of their CPU and memory resources, according to Datadog's November 2023 container report. This means that if two billion containers are spun up each week, nearly a billion containers' worth of compute resources are wasted, leading to unnecessary cloud costs and inefficiency.

What are the main reasons teams overprovision resources in ECS?

Teams often overprovision ECS resources due to uncertainty about actual requirements. Developers and application owners prioritize availability and performance over cost, leading them to add extra CPU, memory, and replicas to reduce the risk of deployment failure. Organizations may also use standardized compute profiles across diverse applications, further contributing to overprovisioning.

How does overprovisioning in ECS impact a company's financial performance?

Overprovisioning increases cloud costs, which can significantly affect a company's bottom line. For example, if a company spends 8.6% of its revenue on cloud costs, reducing this by 20-30% through optimization could increase bottom-line profit by 17%. Efficient ECS resource management directly contributes to improved profitability.

What are the top priorities for FinOps professionals managing ECS costs?

According to the 2024 State of FinOps survey, the top priorities for FinOps professionals are reducing waste or unused resources (50.3%), managing commitment-based discounts (43.3%), and accurate forecasting of spend (41.3%). Overprovisioning and discount management are especially critical for controlling ECS costs.

How can managing discounts help reduce ECS spend?

Effective management of discounts, such as AWS Savings Plans and Reserved Instances, can deliver double-digit reductions in ECS spend. However, it requires careful planning to avoid overcommitting to overall spend amounts and specific resource types, making it a complex but valuable cost-saving strategy.

What is the impact of ECS performance and availability on business revenue?

ECS performance and availability directly affect business revenue. For example, a 100ms increase in latency can result in a 1% revenue loss, which, for a $100M business unit, is equivalent to an 88-hour outage. Performance slowdowns can have the same financial impact as major outages, especially in customer-facing applications.

What are the main challenges in optimizing Amazon ECS for cost and performance?

The main challenges include balancing multiple goals (cost, performance, availability), managing a variety of optimization controls (service and instance rightsizing, purchasing commitments, spot instances), and adapting to constantly changing inputs such as traffic patterns and application releases. The complexity increases with the number of microservices and frequent deployments.

What optimization controls does Amazon ECS provide?

Amazon ECS provides several optimization controls, including service rightsizing (adjusting CPU and memory for tasks/services), instance rightsizing (managing EC2-backed cluster instances and auto scaling groups), purchasing commitments (Savings Plans, Reserved Instances), and spot instances (for fault-tolerant or non-production workloads).

How do traffic and application releases affect ECS optimization?

Traffic patterns and application releases are constantly changing inputs that impact ECS optimization. Fluctuations in traffic (e.g., daily or weekly seasonality) and new application versions can alter resource requirements, cost, and performance, requiring continuous adjustment of ECS settings to maintain optimal operation.

Why is it difficult to optimize a large fleet of ECS microservices?

Optimizing a large fleet of ECS microservices is difficult due to the exponential growth in complexity. For example, managing 100 microservices with weekly releases involves analyzing thousands of metric combinations (CPU, memory, latency, traffic, etc.). Scaling this to thousands of services makes manual optimization nearly impossible, highlighting the need for automated or autonomous solutions.

What metrics are most important for ECS optimization?

The most important metrics for ECS optimization include CPU utilization, memory utilization, performance (latency), and availability (e.g., failed customer interaction rate). These metrics help teams balance cost, performance, and reliability when tuning ECS services.

How does latency impact user experience and business outcomes in ECS applications?

Latency has a direct impact on user experience and business outcomes. For example, a 100ms increase in latency can lead to a 1% revenue loss (Amazon.com finding), and a 500ms latency increase can decrease traffic by 20% (Google finding). Even small delays across microservices can add up, affecting conversion rates and overall revenue.

What are some industry findings on the impact of latency on traffic and revenue?

Industry findings include: Google found a 500ms latency increase decreases traffic by 20%; Amazon found a 100ms latency change drives a 1% revenue gain; Zalando found a 100ms improvement drives a 0.7% revenue gain; Akamai found a 100ms delay reduces conversion rates by 7%; Booking.com found a 30% latency increase costs more than 0.5% in conversion rate.

How do SLOs help in ECS optimization?

Service Level Objectives (SLOs) help break down the ECS optimization problem by allowing teams to minimize cost while ensuring performance and availability needs are met. SLOs provide clear targets for latency and availability, making it easier to balance trade-offs and optimize resources effectively.

What are the challenges of using spot instances in ECS optimization?

Spot instances offer cost savings but are best suited for fault-tolerant or non-production workloads, as they can be interrupted by AWS. Using spot instances in production requires applications to be stateless or able to handle interruptions without impacting user experience.

Why is manual ECS optimization difficult at scale?

Manual ECS optimization is difficult at scale because it requires continuous monitoring and adjustment of multiple metrics and controls across hundreds or thousands of microservices. The complexity and volume of data make it nearly impossible for humans to optimize efficiently without automation or autonomous solutions.

What are the most common ECS optimization goals?

The most common ECS optimization goals are minimizing cost, ensuring performance (meeting latency requirements), and maintaining availability (serving all requests reliably). Balancing these goals requires careful tuning of resources and continuous monitoring.

How does ECS optimization complexity increase with microservice count?

As the number of microservices increases, the number of metric combinations and optimization scenarios grows exponentially. For example, 100 microservices with weekly releases can result in analyzing over 7,200 metric combinations. Managing thousands of services makes manual optimization unmanageable, necessitating automated or autonomous approaches.

What is the relationship between ECS cost optimization and business profitability?

Effective ECS cost optimization can directly improve business profitability. For instance, reducing cloud costs from 8.6% to 6% of revenue can increase a company's bottom-line profit by 17%, demonstrating the financial importance of efficient ECS management.

How do standard compute profiles contribute to ECS overprovisioning?

Organizations often use standard or "t-shirt" compute profiles (e.g., 4 CPU & 8 GB memory) across many services, regardless of actual needs. This one-size-fits-all approach leads to overprovisioning and wasted resources, as different applications may have varying requirements.

Autonomous Cloud Optimization & Sedai Platform

What is Sedai and how does it help with ECS and cloud optimization?

Sedai is an autonomous cloud management platform that optimizes cloud resources for cost, performance, and availability using machine learning. It eliminates manual intervention, reduces cloud costs by up to 50%, improves performance by reducing latency by up to 75%, and proactively resolves issues before they impact users. Sedai supports AWS, Azure, GCP, and Kubernetes environments, making it suitable for ECS optimization challenges. Learn more.

What are the key features of Sedai's autonomous cloud optimization platform?

Sedai's platform offers autonomous optimization, proactive issue resolution, full-stack cloud coverage, release intelligence, plug-and-play implementation, and enterprise-grade governance. It supports modes like Datapilot (observability), Copilot (one-click optimizations), and Autopilot (fully autonomous execution). See details.

How does Sedai reduce cloud costs for ECS users?

Sedai reduces cloud costs by up to 50% through autonomous optimization, rightsizing workloads, and eliminating waste. It continuously analyzes resource usage and makes adjustments without manual intervention, ensuring efficient ECS operations. Read more.

What business impact can ECS users expect from using Sedai?

Businesses using Sedai can expect up to 50% cost savings, 75% latency reduction, 6X productivity gains, and up to 50% fewer failed customer interactions. For example, Palo Alto Networks saved $3.5 million and KnowBe4 achieved 50% cost savings in production. See case studies.

How quickly can Sedai be implemented for ECS optimization?

Sedai offers a plug-and-play implementation that takes just 5 minutes for general use cases and up to 15 minutes for specific scenarios like AWS Lambda. The platform connects securely to cloud accounts using IAM, with no need for complex installations. Get started.

What integrations does Sedai support for ECS and cloud environments?

Sedai integrates with monitoring tools (Cloudwatch, Prometheus, Datadog, Azure Monitor), Kubernetes autoscalers (HPA/VPA, Karpenter), IaC and CI/CD tools (GitLab, GitHub, Bitbucket, Terraform), ITSM (ServiceNow, Jira), notification tools (Slack, Microsoft Teams), and runbook automation platforms. See integrations.

How does Sedai ensure security and compliance for ECS optimization?

Sedai is SOC 2 certified, demonstrating adherence to stringent security and compliance standards. This ensures that all optimizations and integrations are performed securely and meet industry requirements. Learn more.

Who can benefit most from using Sedai for ECS optimization?

Sedai is designed for platform engineers, IT/cloud operations, technology leaders, site reliability engineers (SREs), and FinOps professionals in organizations with significant cloud operations. It is especially valuable for teams managing multi-cloud environments and seeking to optimize cost, performance, and reliability. See buyer personas.

What customer success stories demonstrate Sedai's impact on ECS and cloud optimization?

Notable success stories include KnowBe4 achieving 50% cost savings and $1.2 million AWS bill reduction, Palo Alto Networks saving $3.5 million and reducing Kubernetes costs by 46%, and Belcorp reducing AWS Lambda latency by 77%. Read KnowBe4 case study.

How does Sedai compare to other ECS optimization solutions?

Sedai differentiates itself with 100% autonomous optimization, proactive issue resolution, application-aware intelligence, full-stack cloud coverage, and unique features like release intelligence and plug-and-play implementation. Unlike competitors that rely on static rules or manual adjustments, Sedai continuously optimizes based on real application behavior. See comparison.

What technical documentation is available for Sedai users?

Sedai provides detailed technical documentation covering platform features, setup, and usage. Resources include datasheets, case studies, and strategic guides, accessible at docs.sedai.io/get-started and sedai.io/resources.

What industries have benefited from Sedai's cloud optimization platform?

Sedai's case studies span industries such as cybersecurity (Palo Alto Networks), IT (HP), financial services (Experian, CapitalOne Bank), security awareness training (KnowBe4), travel (Expedia), healthcare (GSK), car rental (Avis), retail/e-commerce (Belcorp), SaaS (Freshworks), and digital commerce (Campspot). See all case studies.

What support and onboarding resources does Sedai offer for ECS optimization?

Sedai provides personalized onboarding sessions, a dedicated Customer Success Manager for enterprise customers, detailed documentation, a community Slack channel, and email/phone support. A 30-day free trial is also available for risk-free evaluation. Start your trial.

Sedai Logo

Amazon ECS Optimization Challenges

BT

Benjamin Thomas

CTO

May 10, 2024

Amazon ECS Optimization Challenges

Featured

Summary

  • Overprovisioning is a significant issue in ECS, with about 65% of containers wasting at least 50% of their CPU and memory resources, leading to nearly a billion containers' worth of compute resources being wasted weekly.
  • Teams often overprovision due to uncertainty about their needs, aiming to ensure availability and performance during peak times or special cases, despite this leading to unnecessary costs.
  • Effective management of ECS and cloud costs overall can significantly affect a company's financial performance. For example, improving cloud cost efficiency by reducing overprovisioning could potentially enhance a company’s bottom-line profit by 17%.
  • ECS optimization also impacts revenue through application performance and availability. A mere 100ms increase in latency can equate to a significant revenue loss, highlighting the critical nature of performance in ECS services.
  • Amazon ECS offers various controls for optimization, including service rightsizing and spot instance usage. These, along with continuous adjustments to respond to changing traffic and application demands, present ongoing challenges and require complex solutions involving both manual and automated strategies.

Cost Challenges

ECS Overprovisioning is a Major Problem

Every year Datadog creates a report on container usage;  we’ll focus on their report from November 2023. Datadog says 65% of containers waste at least 50% of CPU and memory. That means half of all container resources are being wasted. So if two billion containers are spun up each week, almost a billion containers worth of compute resources are wasted, according to the report.

Similar reports from other providers support a staggering level of overprovisioning. Most report waste of at least 50% of CPU and memory.

So the industry has a major problem - everybody overprovisions their compute. 

Reasons for Overprovisioning in Amazon ECS

Why do teams overprovision? Most simply don't know what to provision for.  Due to this uncertainty, they simply overestimate their needs. The logic is “Let me put double or triple the capacity I think I need so that things are OK when I hit my seasonality peaks and my special use cases”. 

The three most common situations we see driving overprovisioning are:

  1. Developers solve for availability/performance, not cost when setting configuration to reduce the risk of deployment failure. Developers may add additional CPU and Memory, and additional replicas. Engineering teams focus on releasing features. They're not focused on what runs in production. So they just keep it safe, building in a standard way and overprovisioning.
  2. Application owners also default to overprovisioning to reduce performance and availability risks for services under their control
  3. Organizations may use “standard” or “t-shirt” compute capacity sizes for applications that have different needs e.g., Using a compute profile of 4 CPU & 8 GB memory across 100 services, while another compute profile of 2 CPU & 32 GB memory might be used across another 50 services.

Managing discounts 

Below are the top priorities of FinOps professionals in the 2024 State of FinOps survey.  While overprovisioning is the #1 priority, it is closely followed by managing discounts.

Successful management of discounts can deliver double digit reductions in spend, but can be difficult to manage due to the need to avoid over committing to overall spend amounts and specific resource types.

Impact of Cloud Costs on Company Financial Performance

Overprovisioning and discount management are not only important to engineering budgets but can be important to company-wide financial performance. Below is a simplified Profit & Loss (P&L) statement for a public security SaaS company. For every $100 million in revenue, approximately 8.6% was spent on cloud costs which is at the high end. If the company could save 20-30% of that 8.6%, bringing down cloud costs to 6% of revenue, the company's bottom-line profit would increase 17%.

66419f3ab26190442f8d83e8_cP8Bz0EtRvCcpT9L3wkrWQjUrcKDhEuFikgG6P82x3QeWw3VehP1zjSp7qdGcbfxZBB9ZWynV7HDviJ_n4TmSxXxPAkpwl4O0_Xip-Ya17z5dVaElxBCnGzOGXvZiigJiksCk47Rezmuy76wgjlix54-1.webp

Revenue Impact of Performance and Availability

In addition to cost impacts, the effectiveness of ECS optimization can also affect revenue via the performance and availability of applications if the ECS services play an important role in end customer experience.  The need to avoid outages is widely understood, but performance slowdowns can have the same impact as major outages.  In an ECS context, latency can be a silent killer if small delays across hundreds of microservices add up to a material impact on user experience.

The example below shows that for a business unit with $100M annual revenue, a 100ms slowdown running across the course of a year has the equivalent impact to an extended 88 hour outage. 

This example assumes a 100ms slowdown for users translates to 1% lost revenue.  This assumption is based on an early Amazon.com finding.  Below is that finding and a series of others:

The overall importance and timeframe of impact will vary by business (e.g., immediate drops in revenue can occur in ecommerce, SaaS impacts would be slower and tied to contract cycles).

ECS Optimization Challenges

Let's jump into some of the challenges that users face when optimizing ECS:

  • Multiple goals to be addressed
  • Many controls to be optimized
  • Constant change in inputs 

Multiple Goals

Let's jump into some of the challenges that ECS users face:

  • Cost, ensuring that ECS services meet their functionality at the required availability and performance with the lowest cost.  Cost efficiency is driven by both engineering (e.g., rightsizing instances) and financial (e.g., savings plan) optimization.
  • Performance, which involves ensuring that the application meets latency requirements e.g., for a web application, page load times are under a given threshold time so that end users do not experience delays.
  • Availability, or ensuring that requests to ECS services can be served by the application.  Historically time based metrics (uptime) were used but in a microservice environment request based metrics such as FCI rate (Failed Customer Interaction rate) can be more effective.

Optimizing each service with respect to all three objectives can be challenging.  In this guide we will look at the use of SLOs to allow us to break down this problem.  SLOs can help us then approach this problem as minimizing cost, subject to meeting performance and availability needs.  We’ll also look at whether workloads can afford to have lower availability thresholds which can allow the use of lower cost spot instances.

Multiple Controls for Amazon ECS Optimization

Amazon provides multiple controls to optimize Amazon ECS for cost and performance:

  • Service Rightsizing: You can rightsize your ECS services, either vertically (increasing or decreasing the amount of CPU and Memory for a given task or service), or horizontally by modifying the number of tasks running. 
  • Instance Rightsizing: You can adjust your cluster by managing your container instances if you use an EC2-backed cluster. Key controls include the number of instances, instance types and auto scaling groups (ASGs).
  • Purchasing Commitments (excluding spot): You could use the discounted purchasing commitments that Amazon has given you including Savings Plans and Reserved Instances. 
  • Spot Instances: Spot is one of several pricing models for Amazon ECS offered by Amazon. They are a good fit for fault-tolerant use cases. Most commonly this is the case in development or pre-production environments where you may be able to use Spot because even if the environment is down for a few minutes, the impact is minimal. Even in production, if the application is fault-tolerant, you may be able to use Spot for stateless services. Spot options include EC2 and Fargate Spot.

It can be challenging to configure all these controls to meet our goals and continually adjust them.   Later in the guide we’ll go through solutions to this challenge including manual, automated and autonomous approaches.

Constantly Changing Inputs 

We also need to look at how we’re performing, how various settings on these controls perform with different application inputs, which include: 

  • Traffic: We need to look at the amount of traffic that's inbound, and its seasonality (e.g., across the course of the day, on different days of the week, etc). 
  • Releases: New versions can change application cost and performance.  For example, a more advanced recommendation function is added which slows performance unless more compute is added.
66419f3ae237201c800795b6_lV9OcJuw1tNTAgrlJZHOG5P0AL1iFM5pcojMqHrgMAsR59AOA2-RAXsPJY0Schv32dfI5LCLoYgr49pznjZPR0lsYMuiZvnX62BnyLgY2QSKlSYr2IKonEy6Yuvw7XfQ3MwikjuB7Wf1fgH3Vg_PpTY-1.webp

Late in the guide we’ll look at how adaptation is possible under manual, automated and autonomous approaches.

Large Set of Metrics Needed

To then relate these inputs and controls to our goals, you need to look at a series of metrics including:

  • CPU utilization and memory utilization (which drive cost)
  • Performance or latency

We now have all these complex combinations to look at, even to optimize only one application to achieve the best resource utilization. Many organizations have thousands of applications to manage, so the complexity grows significantly. Managing 100 microservices that are released weekly involves analyzing 7,200 combinations of metrics as shown below. This includes six performance metrics, various traffic patterns, and four monthly releases. Optimizing each service is a complex task that requires careful analysis and monitoring to ensure smooth operation.

66419f3adbd9bd2ed23a0c12_1Vla9g-oRekOkxi9LNDicLd-4cKX_LtgATS9zR9nGkn-w6TOfLeY1gXmOXbeY_rG9jS3N7vvsYT9eBIna98-OiMX9INbJ-ha1_zED8C9yY6UgtQz2LqIsPXOedU2tXwqqtxInDlCjpeZeIGUAU4ZMR0-1.webp

Just imagine the challenge of optimizing a fleet of 2,000 microservices, compared to just 100. The sheer number is mind-boggling. Constantly optimizing such a large system on a daily basis is a daunting task.  Managing and optimizing a fleet of this size is nearly impossible for any human. This highlights the challenges and complexities involved in the ongoing cost optimization & performance optimization processes.

Rank

Priority

Percent Respondents

1

Reducing waste or unused resources

50.3%

2

Managing commitment based discounts

43.3%

3

Accurate forecasting of spend

41.3%

FinOps Professionals Priorities

Strategy

Outage

Slowdown

Outage / Performance Issue

88 hour outage

100ms slowdown

Impact

$11,000 lost revenue per hour

1% revenue lost

Business Cost

$1M

$1M

Relative Impacts of Outages and Slowdowns

Company

Finding

Google

500ms latency decreases traffic 20%

Amazon

100ms latency change drives 1% revenue gain

Zalando

100ms improvement drives 0.7% revenue gain

Akamai

100ms delay reduces conversion rates by 7%

Booking.com

30% latency increase costs more than 0.5% in conversion rate

Findings on Latency Impact on Traffic & Revenue