Key Takeaways
- Cloud rightsizing matches compute, memory, & storage to observed workload demand, recovering oversized capacity without breaking latency or SLOs.
- Roughly 20–30% of cloud spend goes to oversized resources, making rightsizing one of the most direct FinOps actions an engineering team can take.
- One-time rightsizing audits decay within weeks because deployments & traffic shifts change workload behavior faster than quarterly reviews can track.
- EC2, Kubernetes, & Lambda each require a different playbook: instance-family selection, requests/limits/HPA tuning, & memory–latency curve tuning, respectively.
- Pair every rightsizing change with canary deployments & SLO-based rollback gates so a bad sizing decision never becomes a customer incident.
Your monitoring dashboard shows the EC2 batch cluster at 18% CPU overnight. The checkout pods have a 4 GiB memory limit, with actual usage at 900 MiB at peak. The Lambda image-resize function is configured at 512 MB, but peaks at 180 MB. You know the waste is there. You rightsize and save money for six weeks. Then a new feature ships, traffic patterns shift, and oversized capacity starts accumulating again.
The FinOps Foundation's State of FinOps survey (2024) consistently ranks cloud waste management as the top practitioner priority, with 20–30% of cloud spend going to oversized resources across most infrastructure stacks. That waste comes from static allocations drifting away from actual workload demand. A single audit will not close cloud waste management gaps: the resources that are oversized today will be different from the ones that are oversized after the next deployment.
This playbook covers how to reduce cloud costs through rightsizing: seven steps for EC2, Kubernetes, & Lambda against observed peak behavior. The seventh step, continuous re-evaluation, is the one that preserves savings. The running example is an e-commerce team running EKS for checkout, Lambda for image resize, & EC2 for overnight batch ETL.
Summary
What is cloud rightsizing? | Matching resource allocation to a workload's live demand. |
Why does cloud rightsizing decay between audits? | Deployments & traffic shift faster than quarterly reviews can track. |
Which metrics should engineers use to rightsize workloads? | p95 latency, CPU steady-state, memory headroom, & queue depth. |
How does cloud rightsizing differ across EC2, Kubernetes, and Lambda? | Instance family choice / requests-limits-HPA / memory–latency curve. |
In This Article
- What Is Cloud Rightsizing?
- Why Does Rightsizing Decay After the First Audit?
- Step 1: Build a Workload Baseline from Production Metrics
- Step 2: Pick the Right Utilization Metrics for Each Workload
- Step 3: Match Instance Families to Workload Patterns
- Step 4: Rightsize Kubernetes Requests, Limits & HPA Targets
- Step 5: Tune Lambda Memory for the Cost-Latency Curve
- Step 6: Validate Every Change with Canaries & Rollback Gates
- Step 7: How Do You Make Rightsizing Continuous?
- How Does Sedai Rightsize Continuously?
- Customer Rightsizing Outcomes
- Where Do Rightsizing Teams Go from Here?
- FAQs About Cloud Rightsizing
- Sources
What Is Cloud Rightsizing?
Cloud rightsizing matches compute, memory, & storage allocations to observed workload demand across AWS, Azure, GCP, & Kubernetes. Engineering teams use it to recover the 20–30% of cloud spend that goes to oversized resources (FinOps Foundation, 2024) while protecting latency, throughput, & SLOs. EC2, Kubernetes, & Lambda each need different sizing models plus continuous re-evaluation.
Why Does Rightsizing Decay After the First Audit?
A rightsizing audit is only correct for the workload behavior observed during that audit. Workload behavior does not stay static.
Deployments change memory footprints. New features shift CPU saturation points. Traffic campaigns double load for a week, then drop. Each event can invalidate a rightsizing decision made three months ago, but a quarterly audit cycle does not move at deployment cadence.
The e-commerce team resized its batch workers in January based on a 90-day CPU average. In March, a real-time segment join doubled the memory allocation per run. By April, the "rightsized" instance was undersized, and the January savings had reversed.
The failure mode is operational: teams treat rightsizing as a project to complete rather than a property to maintain.
Step 1: Build a Workload Baseline from Production Metrics
Before changing resources, build a two-to-four-week baseline of actual utilization using AWS Compute Optimizer for EC2 & Lambda, CloudWatch for raw metrics, & Prometheus for Kubernetes. Cover at least one full weekly traffic cycle; extend to the full business cycle for seasonal workloads.
The e-commerce team needs separate baselines: CPU, memory, disk I/O, & network for the EC2 batch workers, including end-of-month spikes; CPU requests, memory working set, p95 latency, & pod restart frequency for the EKS checkout pods; duration, memory used, error rate, & concurrency for the Lambda image-resize function.
The key measurement is p95 or p99 utilization during peak load windows, not the average. An instance that looks 25% utilized on average may run at 90% CPU for 30 minutes every morning during batch ingestion. Sizing to the average creates a latency failure at the actual peak, which is why metric selection comes before instance selection.
Step 2: Pick the Right Utilization Metrics for Each Workload
Sizing every workload to the same metric fails. CPU steady-state misreads memory-bound work. Average memory makes latency-sensitive APIs look safer than they are.
Match the primary sizing signal to the workload's actual constraint:
- CPU-bound workloads (batch ETL, data transformation): size to p95 CPU utilization across full job runs, not averages.
- Memory-bound workloads (Java services, analytics workers, caches): size to peak working set plus a 20–25% headroom margin, never to average memory.
- Latency-sensitive services (APIs, checkout services): size to p99 CPU & memory at peak request rate; the SLO is the primary guardrail.
- I/O-bound workloads (database replicas, streaming consumers): size to throughput & queue depth, not CPU.
For the checkout service on EKS, the right metric is p99 CPU at peak checkout volume during flash sales, not the weekly average. A 40% CPU average during off-peak hours signals predictable traffic peaks, not requests limited to too high. Applying the wrong metric compounds waste by quarter-end. Once that signal is clear, the next choice is the resource shape that fits it.
Step 3: Match Instance Families to Workload Patterns
After you confirm the right utilization metric, match the instance family to the workload's resource ratio. Native tools can generate the initial shortlist: AWS Compute Optimizer for EC2, Azure Advisor for Azure VMs, & GCP Recommender for Compute Engine. These tools size from CPU & memory averages, so treat their output as a starting point, not a final recommendation.
For the e-commerce team's EC2 batch workers, compute-intensive overnight and idle between runs, a compute-optimized family (c-series on AWS) fits better than a general-purpose m-series. If the batch job is memory-intensive from joining large datasets, an r-series may produce a lower total cost per job run than a larger c-series.
Instance-family mismatch grows over time as workload patterns shift. That drift also applies to EC2, Azure VM, & GCP compute selection: the family choice made at launch becomes wrong as the workload evolves. Kubernetes has the same problem, but the knobs are pod requests, limits, & HPA targets.
Step 4: Rightsize Kubernetes Requests, Limits & HPA Targets
Kubernetes rightsizing has three dials:
- Resource request (what the scheduler reserves)
- Limit (the cap before throttling or OOMKill)
- HPA target utilization (the threshold at which the autoscaler adds pods).
The Kubernetes resource management documentation covers how these interact with the scheduler & kubelet. The critical failure mode is a CPU limit set too close to the request: under bursty traffic, the container hits its limit, the kernel throttles, & p99 latency spikes before HPA adds capacity.
For the checkout service:
- Request: p50 CPU & memory under normal load, which is what the scheduler reserves for bin-packing.
- Limit: p99 under peak load plus 15–20% headroom. Keep memory limits high enough to avoid OOMKills during traffic surges.
- HPA target: 60–70% CPU, not 80–90%. A target too close to the limit means HPA cannot respond before latency degrades.
The Kubernetes Vertical Pod Autoscaler (VPA) recommends & applies request/limit changes automatically, but restarts pods to do so. For stateful services or those without PodDisruptionBudgets, treat VPA output as a baseline, not an automated change. Kubernetes resource optimization at cluster scale requires treating node pool sizing, requests, limits, & HPA targets as a system, not independent settings. Serverless sizing uses a different control: memory.
Step 5: Tune Lambda Memory for the Cost-Latency Curve
AWS Lambda pricing is duration × memory allocated. Increasing memory raises the per-invocation cost and reduces execution duration because Lambda allocates CPU proportionally to memory. The optimal setting holds p99 latency within the SLO at the lowest cost per invocation.
The AWS Lambda Power Tuning tool runs a function at multiple memory sizes and plots cost & duration. For the image-resize Lambda, increasing from 512 MB to 1024 MB reduces execution time enough to lower total cost per invocation; the same work finishes in half the time.
Run Lambda Power Tuning across the 256 MB–3008 MB range with a representative payload; the cost curve is U-shaped, so the minimum is the initial sizing target. Validate p99 latency at the chosen setting under peak concurrency. Latency is the gate, and cost is the objective. Re-run the Lambda Power Tuning whenever the Lambda package changes substantially: library upgrades & runtime version changes shift the curve.
Lambda cost optimization tools & patterns cover the interplay between memory, cold starts, & provisioned concurrency. Provisioned concurrency is the next tuning variable once memory is calibrated. Then, rollout safety becomes the next risk.
Step 6: Validate Every Change with Canaries & Rollback Gates
A rightsizing change that breaks latency is worse than no change. For every resize, apply a canary rollout with defined rollback gates before promoting to full production traffic.
The Google SRE Workbook's guidance on implementing SLOs establishes the framework: define error budget thresholds before the change, observe latency & error rates over a defined window, & roll back if any SLO gate is breached.
For the checkout service on EKS: deploy via a PodDisruptionBudget-aware rolling update; monitor p99 latency & checkout error rate for 30 minutes; roll back immediately if p99 latency rises 10ms above baseline or error rate rises 0.1%; promote to full rollout only if both hold.
The SLO-based rollback patterns that make this work at scale require thresholds defined before the change. If a rightsizing decision is reviewed only after a p99 spike, the spike is a customer incident, not a rollback.
Step 7: How Do You Make Rightsizing Continuous?
Steps 1–6 describe one rightsizing cycle — enough to optimize cloud costs in the short term. Every deployment, traffic shift, & infrastructure change can invalidate a prior sizing decision faster than quarterly reviews can track.
Three practices make rightsizing continuous rather than periodic.
- Signal monitoring: Alert when any workload's observed utilization diverges from its sizing target by more than a defined threshold. A 20% delta over 72 hours is a signal to re-evaluate before waste compounds.
- Deployment-triggered re-evaluation: Attach a post-deploy check to every CI/CD pipeline that compares the deployed service's utilization against its two-week baseline. If the delta exceeds the threshold, a rightsizing ticket opens automatically.
- Cross-workload consistency: Apply the same method across EC2, Kubernetes, & Lambda on a common cadence. When each team runs rightsizing independently on different cycles, the cloud environment with the least engineering attention drifts toward waste.
Recommendation-only tooling does not fix the execution problem: the bottleneck is execution speed, not insight quality.
Cloud Rightsizing Stops Working When It Stays Manual
See how Sedai continuously rightsizes EC2, Kubernetes, Lambda, and container workloads based on live application behavior—reducing cloud waste, optimizing costs, and validating every change against performance and SLO requirements.

How Does Sedai Rightsize Continuously?
The Challenge: Static Rightsizing Audits Decay Faster Than Teams Can Review Them
Quarterly audits go stale within weeks. Teams face a choice between under-provisioning (latency risk) & over-provisioning (waste). Native recommender tools identify possible changes but require a human review step before any action. The result is a backlog of stale recommendations, savings that decay, & a cycle that restarts next quarter.
Sedai’s Approach: Continuous, Application-Aware Rightsizing Without Hand-Coded Rules
Sedai's autonomous, application-aware optimization observes production metrics across your workloads: latency, throughput, queue depth, CPU steady-state, & memory usage. For each workload, Sedai selects the right size based on observed behavior, not static thresholds or CPU averages. It applies changes conservatively, verifying each change against live SLO bounds, & rolls back immediately if any metric degrades.
Sedai differs from rule-based automation in how it decides when to act. Automation fires a rule when a threshold is crossed. Sedai uses each workload's observed behavior to decide whether current latency, throughput, & error-rate signals allow a resize.
A batch worker that looks idle at 2 PM but saturates at 2 AM gets treated differently from a checkout service that holds steady CPU through the day. Sedai operates across EC2, EKS/AKS/GKE, Lambda, & ECS, re-evaluating after every deployment so sizing reflects current workload behavior, not the state from six weeks ago.
The Outcome: 27% AWS Cost Reduction and $1.2M Saved at KnowBe4
KnowBe4 used Sedai to cut AWS costs by 27% & save $1.2M through continuous rightsizing across their ECS & Lambda fleet.
Book a demo to see continuous rightsizing running on your stack.
Customer Rightsizing Outcomes
KnowBe4
KnowBe4 needed to rightsize thousands of ECS & Lambda services without adding manual review overhead per change. With Sedai's continuous rightsizing, they cut AWS costs by 27% & saved $1.2M without a single production incident.
"By having Sedai in place, we're not just saving money. We're preventing would-be customer problems before they become an issue."
— Matt Duren, Vice President of Engineering, KnowBe4
Palo Alto Networks
Palo Alto Networks needed to reduce wasted cloud spend across a back-end environment while protecting availability. Sedai's autonomous optimization saved $3.5M in cloud costs.
"Sedai has helped us save millions of dollars by optimizing & managing our own back-end services. But most importantly, what Sedai has done very well is allow us to respond in real time when anomalies are detected."
— Suresh Sangiah, Senior Vice President of Engineering, Palo Alto Networks
Where Do Rightsizing Teams Go from Here?
The six execution steps cover the full rightsizing cycle: baseline measurement, metric selection, instance-family matching, Kubernetes request/limit tuning, Lambda memory optimization, and SLO-gated canary validation.
- Baseline from production metrics
- Pick the right utilization metric per workload
- Match instance families to demand
- Rightsize Kubernetes requests & limits
- Tune Lambda to the cost-latency curve
- Validate every change with SLO-gated canaries.
The step engineers skip most often is the seventh: making it continuous. Workloads change at deployment cadence, & a rightsizing practice that runs quarterly will lag behind spend.
For teams ready to put Step 7 into production, Autonomous FinOps: The Future of Cloud Cost Management explains how to move from periodic rightsizing events to a continuous, application-aware practice.
FAQs About Cloud Rightsizing
How Often Should Engineers Rightsize Cloud Resources?
Rightsizing should be continuous, not quarterly. Workloads change with deployments, traffic spikes, and infrastructure updates. At a minimum, review resources after major deployments or traffic events. For large environments, automated, deployment-triggered reviews are the most effective approach.
What's the Difference Between Cloud Rightsizing and Autoscaling?
Rightsizing adjusts the size of each resource (instances, containers, Lambda memory), while autoscaling adjusts the number of resources running. Rightsize first to eliminate waste, then use autoscaling to reduce costs further during off-peak periods.
Can Rightsizing Hurt Application Performance?
Yes, if done incorrectly. Common mistakes include sizing for average usage instead of peak demand and removing too much headroom. Use p95/p99 metrics, maintain 15–20% buffer capacity, and validate changes with canary deployments before full rollout.
Which Workloads Benefit Most from Continuous Rightsizing?
Variable-demand workloads benefit the most, including e-commerce services, Lambda functions, and seasonal batch jobs. Since their resource needs change frequently, continuous rightsizing prevents ongoing over-provisioning and unnecessary costs.
How Do You Rightsize Kubernetes Without Breaking Pods?
Set requests near p50 utilization for efficient scheduling and limits at p99 utilization plus 15–20% headroom. Keep HPA targets around 60–70%, roll out changes gradually, and monitor latency and errors with clear rollback thresholds.
What Metrics Indicate a Workload Is Over-Provisioned?
For EC2 and Kubernetes, look for CPU below 20%, memory usage below 40% of limits, and no throttling or OOMKills over several weeks. For Lambda, memory usage below 50% of allocation and execution times well below configured timeouts often signal over-provisioning.
Why Don't AWS Compute Optimizer Recommendations Stay Relevant?
Compute Optimizer relies on historical usage data, typically from the previous 14 days. New deployments or workload changes can quickly make recommendations outdated. Because recommendations also require manual review, they often become stale before implementation. Continuous automated re-evaluation helps keep sizing aligned with current demand.
Sources
- FinOps Foundation, Key Priorities Shift in 2024 (2024): https://www.finops.org/insights/key-priorities-shift-in-2024/
- AWS, What Is AWS Compute Optimizer? (2025): https://docs.aws.amazon.com/compute-optimizer/latest/ug/what-is-compute-optimizer.html
- Kubernetes, Resource Management for Pods & Containers (2025): https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
- Kubernetes, Vertical Pod Autoscaler (VPA) (2025): https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler
- Google SRE Workbook, Implementing SLOs (2020): https://sre.google/workbook/implementing-slos/
- Casalboni, A., AWS Lambda Power Tuning (2025): https://github.com/alexcasalboni/aws-lambda-power-tuning
- Microsoft, Azure Advisor Cost Recommendations (2025): https://learn.microsoft.com/en-us/azure/advisor/advisor-cost-recommendations
- Google Cloud, Cloud Recommender Overview (2025): https://cloud.google.com/recommender/docs/overview
- Sedai, KnowBe4 Customer Story: 27% AWS Cost Savings, $1.2M Saved: https://sedai.io/blog/knowbe4
- Sedai, Palo Alto Networks Customer Story: $3.5M Saved: https://sedai.io/blog/palo-alto-networks
