Kubernetes Autoscalers: Fundamentals & Best Practices
What is the role of autoscaling in Kubernetes for cost and performance optimization?
Autoscaling in Kubernetes dynamically adjusts resources to match application demand, ensuring steady performance while minimizing costs. By automatically scaling pods and nodes, organizations can avoid overprovisioning and underutilization, which are common sources of cloud waste. This approach helps maintain high availability and reliability without excessive spending during low-traffic periods.
What are the main types of autoscalers available in Kubernetes?
The main autoscalers in Kubernetes are the Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), and Cluster Autoscaler. HPA adjusts the number of pods based on resource usage, VPA modifies resource requests and limits for individual pods, and Cluster Autoscaler adds or removes nodes to ensure sufficient capacity for pod scheduling.
How does the Horizontal Pod Autoscaler (HPA) work in Kubernetes?
The HPA monitors metrics such as CPU and memory usage every 30 seconds. When usage crosses a predefined threshold, it increases or decreases the number of pod replicas in a deployment or replica controller. Each scaling operation is followed by a cooldown period of 3-5 minutes. HPA can also use custom metrics and is ideal for stateless applications with fluctuating demand.
What is the Vertical Pod Autoscaler (VPA) and when should it be used?
The VPA adjusts the CPU and memory requests and limits for individual pods, scaling them up or down based on utilization. It is best suited for stateful applications, such as databases, that cannot easily scale horizontally. VPA typically operates in recommendation mode but can also auto-update resources, which may require pod restarts due to Kubernetes limitations.
How does the Cluster Autoscaler enhance resource availability in Kubernetes?
The Cluster Autoscaler monitors pending pods every 10 seconds. If a pod cannot be scheduled due to insufficient resources, it communicates with the cloud provider to add a new node. Once the node is available, the Kubernetes scheduler assigns pending pods to it, ensuring optimal resource utilization and application availability.
What are the main limitations of Kubernetes autoscalers like HPA and VPA?
Key limitations include significant configuration overhead, the need for benchmarking and profiling, and the inability to use HPA and VPA together due to conflicting scaling behaviors. HPA cannot scale down to zero, and VPA requires at least two healthy pod replicas and a minimum memory allocation of 250 MB. Both lack predictive scaling and reinforced learning capabilities, and VPA may require pod restarts for resource changes.
When should you use HPA versus VPA in Kubernetes?
HPA is best for stateless applications that can scale horizontally and experience fluctuating demand, especially when combined with a cluster autoscaler for cost savings during off-peak hours. VPA is ideal for stateful applications like databases, where horizontal scaling is complex. VPA is often used in recommendation mode to analyze resource usage without making automatic changes.
What configuration challenges do Kubernetes autoscalers present?
Configuring autoscalers requires selecting appropriate metrics, benchmarking applications, tuning refresh intervals, and architecting cluster capacity. These steps must be repeated for each code iteration, creating significant overhead for administrators and increasing the risk of misconfiguration.
What is KEDA and how does it relate to Kubernetes autoscaling?
KEDA (Kubernetes Event-driven Autoscaling) is a tool that enables event-driven scaling in Kubernetes. It can trigger scaling actions and manage HPAs based on various event sources, such as SQS queues and Lambda events. KEDA supports scaling from zero and is useful for workloads that require dynamic, event-based scaling.
What are the limitations of the Cluster Autoscaler in Kubernetes?
The open-source Cluster Autoscaler has limited support for regional instance scripts and can handle up to 1,000 nodes, with each node supporting up to 30 pods. The cooldown period for scaling down is 10 minutes, and scaling down is limited to 10 nodes at a time. It cannot scale pods with pod disruption budgets or local storage, and scaling up is not performed when a pod node selector is set.
How do you configure HPA and VPA for stateful workloads?
It is generally not recommended to use HPA for stateful workloads. For VPA, configuration involves benchmarking the application, determining optimal resource settings, tuning metric server refresh intervals, and optimizing node types and cluster capacity. Thorough baselining is essential for effective VPA deployment in stateful environments.
What other factors should operators consider for Kubernetes performance management?
Operators should carefully set CPU and memory limits for containers. Setting requests too low and limits too high can cause pods to be terminated if the node cannot provide the required resources. Proper metric selection and resource allocation are critical for stable and efficient Kubernetes deployments.
Why do organizations tend to overprovision resources in Kubernetes?
Organizations often overprovision resources due to concerns about performance issues and the desire to avoid disappointing users. This leads to higher costs and resource wastage, as highlighted by industry reports showing that companies waste an average of 35% of their cloud expenditure.
What is the average utilization of CPU and memory in Kubernetes deployments?
According to a Datadog study, the median Kubernetes deployment utilizes only 20-30% of requested CPU and 30-40% of requested memory, indicating significant overprovisioning and inefficiency in resource allocation.
What is the cost impact of resource wastage in cloud environments?
Resource wastage in cloud environments can be significant, with companies wasting an average of 35% of their cloud expenditure. This inefficiency is often due to overprovisioning to avoid performance issues, leading to unnecessary financial losses.
Why is there a need for autonomous systems in Kubernetes scaling?
Due to the significant overhead and complexity involved in configuring and maintaining autoscalers, there is a need for autonomous systems that can intelligently manage scaling decisions, reduce manual intervention, and optimize resource usage without constant human oversight.
What are the main challenges with using HPA and VPA together?
HPA and VPA cannot be used together because their scaling behaviors may conflict. For example, HPA adjusts the number of pods, while VPA changes resource allocations for individual pods. Using both simultaneously can lead to unpredictable scaling outcomes.
What is the recommended approach for scaling stateful applications in Kubernetes?
For stateful applications, it is recommended to use VPA in recommendation mode to analyze resource usage and make informed adjustments. HPA is generally not suitable for stateful workloads due to the complexity of horizontal scaling for such applications.
How does Sedai's autonomous cloud management platform address Kubernetes autoscaler limitations?
Sedai's autonomous cloud management platform eliminates manual configuration overhead by using machine learning to optimize cloud resources for cost, performance, and availability. It proactively resolves issues, continuously learns from outcomes, and provides full-stack coverage across AWS, Azure, GCP, and Kubernetes, addressing the limitations of traditional autoscalers.
What are the key features of Sedai's autonomous cloud optimization platform?
Sedai offers autonomous optimization, proactive issue resolution, full-stack cloud coverage, release intelligence, enterprise-grade governance, and plug-and-play implementation. It supports Datapilot (observability), Copilot (one-click optimizations), and Autopilot (fully autonomous execution) modes, and integrates with major cloud providers and Kubernetes environments.
How does Sedai help reduce cloud costs and improve performance?
Sedai reduces cloud costs by up to 50% through autonomous optimization and rightsizing, and improves performance by reducing latency by up to 75%. It automates routine tasks, proactively resolves issues, and ensures optimal resource utilization, as demonstrated by customer success stories like Palo Alto Networks and Belcorp.
What integrations does Sedai support for Kubernetes and cloud environments?
Sedai integrates with monitoring and APM tools (Cloudwatch, Prometheus, Datadog, Azure Monitor), Kubernetes autoscalers (HPA/VPA, Karpenter), Infrastructure as Code and CI/CD tools (GitLab, GitHub, Bitbucket, Terraform), ITSM tools (ServiceNow, Jira), notification tools (Slack, Microsoft Teams), and various runbook automation platforms.
What security and compliance certifications does Sedai have?
Sedai is SOC 2 certified, demonstrating adherence to stringent security requirements and industry standards for data protection and compliance. For more details, visit Sedai's Security page.
How easy is it to implement Sedai for Kubernetes optimization?
Sedai offers a plug-and-play implementation that takes just 5 minutes for general use cases and up to 15 minutes for specific scenarios like AWS Lambda. The platform connects securely using IAM, requires no agents, and provides comprehensive onboarding support, documentation, and a 30-day free trial.
Who can benefit from using Sedai for Kubernetes and cloud optimization?
Sedai is designed for platform engineers, IT/cloud operations teams, technology leaders (CTO, CIO, VP Engineering), site reliability engineers (SREs), and FinOps professionals in organizations with significant cloud operations across industries such as cybersecurity, IT, financial services, healthcare, travel, and e-commerce.
What business impact can customers expect from using Sedai?
Customers can expect up to 50% reduction in cloud costs, 75% latency reduction, 6X productivity gains, and up to 50% fewer failed customer interactions. Notable customers like Palo Alto Networks saved $3.5 million, and KnowBe4 achieved 50% cost savings in production. For more details, see Sedai's resources page.
How does Sedai compare to traditional Kubernetes autoscalers?
Unlike traditional autoscalers that require manual configuration and operate reactively, Sedai provides 100% autonomous optimization, proactive issue resolution, and application-aware intelligence. It continuously learns from outcomes, optimizes across the full stack, and delivers measurable cost and performance improvements without manual intervention.
What customer feedback has Sedai received regarding ease of use?
Customers highlight Sedai's quick setup (5–15 minutes), agentless integration, personalized onboarding, and extensive support resources. The 30-day free trial and dedicated Customer Success Manager for enterprise customers contribute to positive feedback on ease of use and adoption.
What are some real-world success stories of Sedai customers?
KnowBe4 achieved 50% cost savings and saved $1.2 million on AWS bills. Palo Alto Networks saved $3.5 million and reduced Kubernetes costs by 46%. Belcorp reduced AWS Lambda latency by 77%. For more, see Sedai's case studies.
What industries are represented in Sedai's case studies?
Sedai's case studies cover cybersecurity (Palo Alto Networks), IT (HP), financial services (Experian, CapitalOne Bank), security awareness training (KnowBe4), travel (Expedia), healthcare (GSK), car rental (Avis), retail/e-commerce (Belcorp), SaaS (Freshworks), and digital commerce (Campspot).
Where can I find technical documentation for Sedai?
Sedai provides detailed technical documentation on its documentation page, covering features, setup, and usage. Additional resources, including case studies and datasheets, are available on the resources page.
What pain points does Sedai solve for Kubernetes and cloud users?
Sedai addresses cost inefficiencies, operational toil, performance and latency issues, lack of proactive issue resolution, complexity in multi-cloud environments, and misaligned priorities between engineering and FinOps teams. It automates optimization, reduces manual tasks, and aligns business and technical goals.
What makes Sedai different from other cloud optimization solutions?
Sedai differentiates itself with 100% autonomous optimization, proactive issue resolution, application-aware intelligence, full-stack coverage, release intelligence, and rapid plug-and-play implementation. It delivers measurable ROI and productivity gains, and is trusted by leading enterprises across multiple industries.
How does Sedai ensure safe and compliant cloud optimization?
Sedai uses safety-by-design principles, including constrained, validated, and reversible optimizations. It integrates with Infrastructure as Code (IaC), ITSM, and compliance workflows, and is SOC 2 certified for data protection and compliance assurance.
What modes of operation does Sedai offer for cloud optimization?
Sedai offers three modes: Datapilot (observability), Copilot (one-click optimizations), and Autopilot (fully autonomous execution). This flexibility allows organizations to choose the level of automation that fits their operational needs.
How does Sedai support release quality and risk management?
Sedai's Release Intelligence feature tracks changes in cost, latency, and errors for each deployment, ensuring smoother releases and minimizing risks. This helps teams maintain high release quality and reduce the likelihood of errors impacting production.
What support resources are available for Sedai users?
Sedai provides detailed documentation, a community Slack channel, email and phone support, and personalized onboarding sessions. Enterprise customers receive a dedicated Customer Success Manager for tailored assistance throughout the adoption process.
Using Kubernetes Autoscalers to Optimize for Cost and Performance
PM
Pooja Malik
Content Writer
August 22, 2022
Featured
Introduction
We will explore the key role of autoscaling in optimizing performance and cost within Kubernetes, a popular container orchestration platform. Specifically, we'll delve into two critical autoscalers—Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA)—and shed light on their functionalities, benefits, and limitations. This is based on a talk I gave at our annual conference, autocon/22. You can view the video here.
The Problem
In today's digital landscape, ensuring application scalability, reliability, and high availability is crucial for companies aiming to meet the demands of ever-increasing traffic. However, many organizations find themselves facing uncertainties about whether their current infrastructure can handle exponential growth. Fluctuating app traffic patterns throughout the day, coupled with the occasional need for substantial batch processing, present challenges that require careful resource management. To effectively meet these dynamic traffic demands, companies often consider allocating additional resources. However, this approach can lead to excessive spending during non-busy hours, resulting in resource wastage. Nevertheless, this wastage is a preferable alternative to experiencing detrimental outages that can tarnish a company's reputation.
Overprovisioning and Underutilization Challenges
According to a report by Gartner, the global cloud spending is projected to surpass a staggering $482 billion in 2022. As cloud adoption continues to rise, so does the issue of wastage. Flexera's 2021 State of the Cloud Report reveals that, on average, companies waste approximately 35% of their cloud expenditure. Additionally, a study conducted by Datadog highlights that the median Kubernetes deployment only utilizes 20 to 30% of requested CPU and 30 to 40% of requested memory. This not only results in financial losses but also poses potential security risks. So, why do organizations tend to overprovision? Most commonly, they overprovision due to concerns about performance issues and a desire to avoid disappointing their users.
The Cost of Wastage in Cloud Environments
As cloud adoption continues to rise, the issue of wastage becomes more prevalent. Reports indicate that companies waste an average of 35% of their cloud expenditure, and Kubernetes deployments often utilize only a fraction of requested CPU and memory resources. Overprovisioning is commonly driven by concerns about performance issues and a desire to avoid disappointing users.
Auto Scaling as a Solution for Performance and Cost Optimization:
As many of you may have already guessed, the solution for optimizing cost and performance lies in auto scaling. According to AWS, auto scaling is the primary pillar for achieving this goal. It monitors your applications and automatically adjusts capacity to maintain steady, predictable performance at the lowest possible cost. here
In Kubernetes, there are two approaches to scaling. First, you can scale at the node level, which involves adding more nodes for horizontal scaling or changing the instance types for vertical scaling. Second, there is pod-level scaling, which can be achieved using the Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA). With HPA, you can increase the number of pods, while VPA allows you to add more resources to the pods.
Horizontal Pod Autoscaler (HPA) for Dynamic Pod Adjustment
The Horizontal Pod Autoscaler (HPA) (see image below ) adjusts the number of pods dynamically based on computational needs. By monitoring metrics and updating pod replicas within the deployment or replica controller, the HPA adds or removes pods according to traffic demand. It primarily focuses on CPU and memory metrics and can accommodate custom metrics.
A pod can be visualized as an application instance, consisting of multiple containers that function as a cohesive unit. The Horizontal Pod Autoscaler (HPA) plays a crucial role in managing these pods. When there is an increase in computational needs due to traffic demand, the HPA adds pods dynamically. Conversely, it removes pods when there is a decrease in resource usage. Although the HPA primarily relies on CPU and memory metrics, it can also accommodate certain custom metrics.
Let's dive deeper into how the Horizontal Pod Autoscaler operates. The HPA continuously monitors metrics every 30 seconds. Once a predefined threshold is reached, it updates the number of replicas within the deployment or replica controller, resulting in the addition or removal of pods. A cooldown period of approximately three to five minutes follows each scale-up or scale-down operation. However, it's important to note that the HPA's effectiveness is dependent on the availability of resources and space within the cluster. If there is insufficient capacity, the HPA cannot schedule additional pods.
To address limitations related to resource availability within the cluster, the HPA can be complemented with a cluster autoscaler. The cluster autoscaler adds more nodes and resources to the cluster, ensuring effective pod scheduling and optimal resource utilization. (see image below )
Let's delve into how the cluster autoscaler operates. It periodically examines the status of pending pods, checking every 10 seconds. When it detects a pod in a pending state, it initiates communication with the cloud provider. The cloud provider then attempts to allocate a node to accommodate the pending pod. Once the node is successfully allocated, it becomes part of your cluster and becomes capable of hosting pods. Subsequently, the Kubernetes scheduler assigns the pending pods to the newly added node.
Vertical Pod Autoscaler (VPA): Scaling Resource Requests and Limits
While HPA and cluster autoscaler are effective for horizontal scaling, not all applications can easily scale in that manner. Stateful applications like databases often face challenges when it comes to horizontal scaling, as adding new pods can be complex. These applications often require techniques like sharding or read-only replicas. However, it is possible to improve their performance by adjusting CPU and memory resources. (see image below )
This is where the Vertical Pod Autoscaler (VPA) comes into play. VPA focuses on scaling the resource requests and limits of individual pods. If an application is underutilizing its allocated resources, VPA will scale up the resource requests. Conversely, if an application is over utilizing resources, VPA will scale down the allocations.
Let's explore how the Vertical Pod Autoscaler functions. VPA monitors metrics at regular intervals, typically every 10 seconds. When a predefined threshold is reached, VPA updates the resource specifications in the deployment or replica controller. However, it's important to note that Kubernetes does not currently support in-place replacement of resources, which means that pods may need to be restarted when adjustments are made. Similar to HPA, VPA also incorporates a cooldown period of three to five minutes for scaling up and scaling down operations.
When to use HPA vs VPA
Determining the appropriate use cases for HPA and VPA is a key consideration. HPA proves most effective for stateless applications that can scale horizontally with ease. It is also well-suited for applications that experience regular fluctuations in demand, such as seasonal variations. However, to maximize cost savings during off-peak hours, it is advisable to utilize
HPA alongside a cluster autoscaler. This combination allows for efficient resource allocation and monetary optimization. ( see image below )
On the other hand, VPA serves as the ideal solution for stateful applications, particularly those involving databases. VPA's functionality is relatively new in the Kubernetes ecosystem, and many companies opt to employ it in recommendation mode rather than auto-update mode. This approach enables organizations to gain insights into an application's resource utilization profile without making automatic adjustments.
By discerning the specific characteristics and requirements of your applications, you can determine whether HPA or VPA is the more suitable choice, ensuring optimal scalability and resource management.
Kubernetes Autoscaler Limitations
While HPA and VPA may appear to be promising solutions, the reality is that only a limited number of companies are utilizing them in their production environments. This is primarily due to various limitations associated with these Kubernetes autoscalers.
One significant limitation is the considerable overhead involved in configuring HPA and VPA correctly. This process entails identifying the appropriate auto scaling metrics from a plethora of options. Additionally, benchmarking and profiling are necessary to determine the optimal configuration values for your code. Fine-tuning the metric server refresh intervals, HPA refresh intervals, and VPA refresh intervals is also essential. Moreover, careful consideration must be given to cluster capacity, node architecture, and size, all of which need to be repeated for each iteration. This overhead poses a significant burden for administrators and users.
Another issue arises if something goes wrong during scaling, such as an increase in error rates. Unfortunately, HPA and VPA lack reinforced learning capabilities, meaning they will continue to scale up or down even in the presence of increased error rates.
Furthermore, most companies are aware of their application's seasonality and desire proactive scaling to optimize efficiency. However, Kubernetes does not currently offer predictive scaling capabilities, leaving companies without a means to anticipate workload fluctuations.
It is worth noting that HPA and VPA cannot be used together as they may conflict with each other's scaling behavior. Additionally, there are specific limitations for each autoscaler. VPA necessitates the presence of at least two healthy pod replicas, while HPA cannot scale down to zero. Moreover, VPA requires a minimum memory location of 250 MB by default, making it unsuitable for small applications. Furthermore, VPA cannot be utilized for pods lacking an owner and deployed solely as pods rather than as part of a deployment or a stateful set. ( see image below )
Cluster Autoscaler Limitation
Like HPA and VPA, cluster autoscaler also has its share of limitations. Firstly, there is limited support for the open-source version of cluster autoscaler, as it does not cover regional instance scripts. Additionally, there are restrictions on the maximum number of nodes it can handle concurrently. Currently, it can handle up to a thousand nodes, with each node accommodating a maximum of 30 pods. Moreover, the cooldown period for scale downtime is relatively long, set at 10 minutes. When associated with a pod node selector, scaling up is not performed. Scaling down is restricted to a maximum of 10 nodes at a time, and the CPU usage must be at least 20% for scaling down to occur.
Furthermore, there are challenges related to pods (as shown in the image below). If a pod disruption budget and local storage are associated with a pod, the cluster autoscaler will not scale it down.
Conclusion
In summary, scaling is crucial for optimizing performance and cost, and Kubernetes offers a solid foundation for it. However, there is significant overhead for administrators in terms of choosing auto scaling metrics, architecting cluster capacity, and benchmarking applications for configuration decisions. This overhead is repeated for each iteration of the code, posing a considerable burden. Therefore, there is a need for a true autonomous system that can address these challenges.
In the upcoming articles, we will be talking more about Autoscaling Kubernetes, cluster autoscaling, scaling tools and event-driven autoscaling. To get a preview check out the autocon/22 video here
Q&A
Q: How do you configure HPA and VPA for stateful workloads?
A: It is recommended not to use HPA for stateful workloads. As for VPA, the configuration process is similar to other resources. However, it requires thorough baselining, including benchmarking the application, determining the right configuration, fine-tuning metric server refresh intervals, and optimizing node instance types and cluster capacity.
Q: What is Keda and how does it relate to Kubernetes auto-scaling?
A: KEDA is the prominent tool. With KEDA, you can trigger scaling and manage HPAs. It supports multiple event sources such as SQS queues, Lambda events, and more. Based on these events, you can trigger HPA or create other scaling objects, even scaling from zero.
Q: Apart from scaling, what other factors should operators consider to manage performance for Kubernetes deployments?
A: Setting appropriate CPU and memory limits is crucial. For example, if you set low CPU in the request and a higher value in the limit for a job application, the node may not be able to provide the required resources when the workload demands more. In such cases, the node will terminate the pod. To avoid this, you need to ensure you set the right metrics and limits for your containers.