Optimize compute, storage and data
Choose copilot or autopilot execution
Continuously improve with reinforcement learning
In the past few years, we have seen a huge increase in adoption for Kubernetes. Kubernetes is an excellent platform, which provides you with all the facilities to fine-tune your application and run it the way you need. We at Sedai, use Kubernetes for deploying our applications along with surveillance and some other managed services from AWS. One question we get a lot is “Is there a need for another system or an autonomous system to manage Kubernetes?” Kubernetes is in fact a declarative system, an excellent one at which you can specify the needs of running your application, and Kubernetes will take care of the rest. But in fact, it is actually giving you all the bells and whistles required for its users to customize and fine-tune the way you need to use Kubernetes. So effectively, Kubernetes is actually giving all the responsibilities for utilizing it to the users themselves. You can watch the original video here.
So in general, if there is a cluster with around a hundred microservices with approximately two weeks for each cycle, you are looking at around 18,000 different variable combinations to make sure you are running the applications in the best possible way, and at the best cost benefit that you need. This is almost impossible to be handled in a manual way. If you have listened to the previous presentations about the autoscalers, you know what our combinations are going to be available there. And if you get any of those ones incorrect, you are going to have a big negative impact on the availability or on the performance.
When you're trying to make sure you're running Kubernetes in the best possible way, your high level goals are making sure your applications are running at a very high availability and you are running your applications with good performance, the way you need it, by and minimizing the cost of your whole cluster. The natural solution for this is actually to use autoscalers. We already saw how you can utilize HPA/VPA to autoscale your application, and to utilize that at a cluster level, we need to have some kind of a cluster level autoscale. The way you go about doing it is by looking at the resource usage from the application, checking the application behavior, coming up with thresholds and at a pod level autoscale. There are different ways of handling it, but most of the autoscalers look at unscheduled requests. You look at the application requirement and try to supplement with the infrastructure.
What are the steps needed to actually automate and make sure your Kubernetes is running at the right possible way?
As a first step, when we pick something for automation, there are a few things that we look at, the attributes, which are where you choose candidates for automation. Those are repetitive tasks, which are really tedious, time consuming, and naturally it is going to be error prone. And most importantly, these are tasks that do not change. So once you set it up, you don't need to actually worry about it at all. It just keeps on doing clockwork. Some of the common examples that we have seen is automation of bills, CI/CDs, some level of testing can be automated.
To successfully automate Kubernetes and ensure optimal performance and availability, several crucial aspects need to be considered. Expanding upon the attributes we previously discussed, the initial step involves examining application behavior. Gaining insights into how an application functions is achieved by analyzing relevant metrics.
The next important factor is determining the appropriate thresholds for these metrics. By establishing suitable thresholds, adjustments can be made to the number of pods allocated to an application. If additional resources are required, pod numbers can be increased, while in cases where they are not needed, pod numbers can be reduced. At the cluster level, it becomes necessary to modify the underlying infrastructure that supports application requests. Sometimes, the default functionalities provided by Kubernetes may not be sufficient, necessitating the implementation of custom scripted remediations. These remediations are tailored to specific attributes and aid in resolving issues.
Furthermore, it is advisable to implement a notification system to ensure that relevant stakeholders remain informed about the system's status. Effective notifications enable timely awareness of any ongoing developments within the system.
To effectively manage availability and optimize performance, several key steps need to be taken. Let's explore these steps in a way that is easily understandable for your readers:
Choosing the Right Metrics: When monitoring application performance, it's important to select metrics that are relevant to each specific application. Different applications have unique behaviors, so there are no universal metrics that apply to all cases. For example, some applications may benefit from monitoring saturation metrics, CPU usage, or memory utilization, while others prioritize performance over resource usage. Additionally, factors like network usage and IOPS can also impact application behavior and should be considered.
Setting Suitable Thresholds: Even if two applications use the same metrics, their behaviors can differ based on different performance thresholds. Therefore, it is crucial to establish appropriate thresholds for each application. As Kubernetes clusters often consist of diverse workloads, it's essential to allocate the right types of nodes to optimize performance and minimize costs.
Tailored Remediation Approaches: There is no one-size-fits-all solution for fixing issues. Different situations and applications may require different approaches. The objective is to ensure smooth operation rather than immediate bug fixing. By addressing underlying issues and implementing appropriate remediation strategies, you can maintain the desired performance levels.
Effective Alerts & Notifications: Timing is critical when it comes to sending notifications. Sending an excessive number of notifications can lead to them being ignored or dismissed, while insufficient notifications may result in availability problems going unnoticed. It's important to strike the right balance and send timely notifications that provide actionable information.
In the context of optimizing performance, we need to consider the constraints and adjustable variables involved to balance cost containment and performance levels. Two potential solutions are High-Performance Autoscaling (HPA) and Vertical Pod Autoscaling (VPA), which require configuring the appropriate metrics and thresholds to achieve desired outcomes.
Another significant aspect to consider is traffic. As you may have already learned from previous sessions, traffic can have a profound impact on application performance. The same application may exhibit different behavior under varying traffic loads. Seasonality is a crucial factor to consider when optimizing for both performance and cost reduction. Furthermore, the rapid pace of application releases, facilitated by technologies like Kubernetes and various cloud provider services, adds complexity to the optimization process. Even for non-microservices applications, it is common to have an average release cycle of approximately two weeks. The behavior of an application can undergo significant changes from one version to another, making it essential to adapt optimization strategies accordingly.
When it comes to the cluster environment, if you are utilizing cloud products such as Amazon Elastic Kubernetes Service (EKS), you are presented with a multitude of infrastructure options. This abundance of choices is advantageous, but it also presents the challenge of selecting the most appropriate one. In the case of AWS, there are additional choices available to help reduce costs, such as utilizing spot instances instead of on-demand instances. Spot instances are suitable for applications that can tolerate restarts and are less impacted by infrastructure failures. Thus, careful consideration is required to determine the optimal utilization of these options.
In the image below, you will see two configurations. They both are Horizontal Pod Autoscaler configurations. Just look at the highlighted ones. Those are the attributes that need to be configured on a per application and a per revision basis.
The visual representation illustrates the HPA configuration for the same application, highlighting the disparity between two different versions. Despite utilizing the same metric, namely CPU, for HPA, the average utilization varies due to changes in application behavior. These complexities necessitate careful consideration to attain the optimal combination and ensure the smooth operation of the application. The chart depicted in the bottom right corner represents the problem we are endeavoring to solve.
This chart demonstrates the application's latency under a consistent traffic workload. The X-axis represents the number of pods recorded during the measurements, while the blue line depicts the latency decreasing as the number of pods increases, indicating a responsive scaling process. However, beyond a certain point, the latency improvement becomes marginal, plateauing at around 3.5 milliseconds. On the other hand, the cost associated with additional resources exhibits a nearly linear increase as more pods are added. Taking into account the cost budget and the performance Service Level Agreement (SLA), the shaded area represents the critical considerations for achieving the desired outcome.
Determining the optimal configuration that strikes a balance between performance and cost is paramount. The black marker in this illustration represents the ideal position for this specific application. The multitude of combinations and intricacies involved in achieving this alignment underscore the complexity of the task. Failing to achieve the correct balance can have a cumulative impact on both application performance and cost.
In managing HPA and VPA for 100 services, the challenge lies in effectively addressing the unique operational characteristics of each application. Some applications may benefit from increased resource allocation, while others may respond better to scaling the number of pods instead of augmenting CPU capacity. Thus, it's crucial to assess and address these attributes accordingly.
Regarding metrics, a conservative approach will consider three potential combinations: CPU, memory, and a performance metric. Many applications have varying traffic patterns, so let's assume three average traffic patterns. Additionally, there are five infrastructure options (although the actual number may be larger depending on the cloud provider, such as AWS). This results in approximately 18,000 unique combinations to navigate through each month, emphasizing the need to ensure optimal application performance and cost-effectiveness.
The inherent complexity of managing these intricate combinations renders manual handling virtually impractical. Even with the implementation of automated processes, the dynamic nature of the variables involved makes achieving optimal outcomes exceedingly challenging. This is precisely where the integration of an autonomous system becomes invaluable.
Let's revisit the crucial attributes we discussed earlier, which are essential for fine-tuning applications and optimizing costs. The first attribute pertains to metrics observation, and the challenge lies in determining the most relevant metrics. An autonomous system should seamlessly integrate with monitoring providers, comprehensively analyze application behavior, and intelligently select the appropriate metrics that accurately represent the application's performance.
Furthermore, by continuously monitoring and analyzing the application's behavior, the system can dynamically establish thresholds. These thresholds may either be perceived as threshold-free, as they continually adapt, or as thresholds tailored to each application release. The autonomous system's understanding of application behavior becomes instrumental when selecting the most suitable infrastructure for the cluster. This intelligent decision-making process ensures that the chosen infrastructure aligns with the specific requirements of each individual application.
Moreover, not all applications warrant the same remediation measures. Customized remediation solutions may be necessary, and their applicability can vary across different contexts. An autonomous system should possess the capability to identify the appropriate remediation approach for a given situation. Even when functioning autonomously, the system should recognize instances where human intervention is necessary and promptly notify users accordingly.
In summary, while an autonomous system is expected to operate independently, it should also possess the capability to involve users when human intervention is essential.
In addition to the points we have previously discussed, there are a few more aspects to consider regarding an autonomous system. Once an autonomous system comprehends the behavior of an application, it gains the ability to predict how the application will perform under different traffic loads or patterns. By identifying early indicators, the system can proactively anticipate and resolve potential issues before they even occur.
While High-Performance Autoscaling (HPA) configurations and similar mechanisms can be adjusted to be more proactive and preempt issues, they primarily function in a reactive manner, responding after an event has taken place. In contrast, an autonomous system, leveraging machine learning or similar techniques, has the potential to anticipate and predict events, making it a proactive solution
Implementing an autonomous system that operates on behalf of users necessitates the implementation of robust safety checks to ensure system stability and prevent any negative impact or degradation. Emphasizing rigorous safety measures is crucial to maintain a reliable and secure environment. While autonomous systems offer valuable capabilities, it is important to recognize their limitations. They cannot replace every human function, and certain tasks and scenarios surpass their capabilities. Therefore, inherent safety should be a priority for autonomous systems, focusing only on actions they can reliably execute autonomously.
Autonomous systems have the responsibility to recognize their own limitations. Situations may arise that exceed the system's capabilities, requiring human intervention. In such cases, prompt notification and deferral to human expertise are essential. This collaborative approach ensures that the system operates within known boundaries and effectively addresses complex and unique challenges.
One of the primary responsibilities of an autonomous system is to recognize and acknowledge its own limitations. There may be situations that surpass the system's capabilities, necessitating human intervention. In such instances, it is crucial for the autonomous system to promptly notify the user and defer to human expertise. This collaborative approach ensures that the system operates within its known boundaries and effectively addresses complex and unique challenges.
Furthermore, when entrusting an autonomous system to manage a cluster or other systems, having comprehensive configuration capabilities is essential. For example, in the context of Kubernetes, the system should provide a robust configuration interface that allows users to define high-level goals and objectives. This empowers users to align the actions of the autonomous system with their specific requirements and preferences. As the autonomous system continuously manages the Kubernetes cluster, it should incorporate feedback and analyze the effects of its actions to enhance its learning process. By assessing the outcomes and consequences of its decisions, the system can continually improve its performance, adapt to changing conditions, and refine its decision-making algorithms.
Understanding the limitations of an autonomous system is crucial. It should operate within its capabilities and involve human intervention when necessary. The system should offer configuration capabilities to align its actions with user-defined goals, and it should leverage learning mechanisms to enhance its performance and effectiveness over time. By embracing these principles, an autonomous system can operate safely, efficiently, and in harmony with human expertise.
In managing availability in Kubernetes or any other cloud provider, an autonomous system follows a high-level process. It connects to the infrastructure provider or Kubernetes control plane API, understands the topology, identifies applications, discovers metrics, correlates them with applications, and infers application behavior. This comprehensive understanding allows the system to effectively manage availability.
Once the application behavior is understood, it should be able to predict what is going to happen under different situations. It should be able to understand early indicators from metrics and proactively fix issues for you. At the same time, sometimes not all predictions are going to figure out the issues going to happen. It should also be able to detect issues that are happening on a real time basis and make adjustments accordingly.
When considering Kubernetes, there are often questions regarding availability and the role of an autonomous system in addressing these concerns. An illustrative example pertains to resource allocation within Kubernetes, where the specified limits and requests can significantly impact application performance. For instance, if a user allocates insufficient memory limits, the application may encounter out-of-memory kills. Similarly, incorrect CPU thresholds can lead to throttling issues. In such scenarios, an autonomous system should possess the capability to autonomously identify and rectify these problems by reconfiguring the resource allocations, thereby mitigating the occurrence of these issues.
One of the key elements of an autonomous system is its ongoing monitoring and evaluation process. It is not enough for the system to simply implement changes and disengage. Instead, it should continuously observe and assess the outcomes of its actions. When a user performs an action or makes a change in their cluster, they expect the autonomous system to retrospectively analyze the effectiveness of that action. Did it achieve the intended results? Did it have any negative impacts on the system? While it is ideal to prevent issues from occurring, if a problem arises, the system should promptly notify the user for assistance. Additionally, if the observed outcomes do not align with the expected results, the system should be capable of adapting its actions through the use of reinforcement learning techniques.
In the realm of autonomous optimization, the process shares similarities with availability management. The initial steps involve discovering the topology and gaining an understanding of the application behavior.
In the context of autonomous optimization, the process aligns with availability management, sharing similar principles and steps. The initial stages involve discovering the topology and gaining a deep understanding of the application behavior. Furthermore, it is crucial for the system to regularly evaluate the effectiveness of its actions and make adjustments accordingly. If it determines that a particular action is unlikely to yield the desired results, the system should reset and incorporate the learnings from that execution into its future actions. These activities must be performed on an application-by-application basis and with consideration for each release. In the case of availability management, these activities are carried out continuously, 24/7, to ensure a reliable and uninterrupted system operation.
That was the basis of how an autonomous system should work. Sedai is an autonomous application management system, which is built on all the foundation principles that we just went over earlier. It is an application management system. It talks to your infrastructure provider or any other APIs available, understands your application behavior, and it tries to ensure your availability at the highest possible level. The image above shows some recommendations that Sedai is putting out. In this case, it is actually talking about setting the CPU limit, or it is actually seeing high container CPU usage. So you need to adjust the configurations accordingly.
And now, Sedai also have a cluster management, which takes into account all the inferences of the applications, and it brings that intelligence to manage your cluster. Making sure the infrastructure that is used is actually the best possible.
Refer to the image below.
There's some of the application specific recommendations or autonomous actions that the system has taken. In this case, you can see how much performance improvement the system is able to bring in by just reconfiguring. And the cumulative cost improvement is actually shown at the top. On the bottom, you're seeing a similar recommendation or actions that we are taking at a cluster level. It brings in that application awareness into the cluster level optimization.
One of the principles at Sedai is actually to make sure that we make life easier for its customers. So we want to make sure using Sedai is trial free. It's super simple.
One of the things that we always strive to make sure is that onboarding Sedai should be seamless. Using Sedai should be seamless. We are striving at every possible way to make sure you can easily configure Sedai to monitor your cluster. The image above is actually a screen where you add an AWS account into Sedai, three or four fields, which can be obtained from your EKS system and Sedai should be able to monitor your Kubernetes cluster.
We know that customers like flexibility. Our approach is to be as non-intrusive as possible. The pros that they are taking is actually connecting to your cloud provider, understanding your topology, everything using APIs. We have multiple options. We recommend using that agentless option, where we connect to the Kubernetes control plane API, using the network, using a private link public network, VPC peering. At the same time, we understand that there are certain security requirements on companies where they do not want to expose APIs over the network, where we have an agent based solution as well.
As mentioned earlier, we want our system to be highly configurable. We want the users to be able to set high level goals, or if you want to fine-tune and specifically configure your applications on an individual workload basis. You have configuration options available at an account level and individual application or workload level, or you can create custom groups and city goals there, and Sedai similar to how Kubernetes work. We will make sure your goal is met on a continuous basis.
This is just a high level architecture of Sedai, which you might have already read in the previous blogs. It just shows you the different systems that we connect to cloud providers, monitoring providers, your ticketing system, notification systems. If you want to do custom actions, we can also connect to StackStorm, Rundeck where you can customize how the system works.
Q: How do you ensure safety while performing any actions with Sedai?
A: When we built Sedai, we wanted to make sure the system is inherently safe. So we only act on things which can be safely executed. For example, when we identify there is an availability problem on an application, which doesn't have, which is not stateful, for example, there is a pod which is behaving totally weird, and it is safe to restart that we'll go and do it. So inherently, we only pick safe actions. On top of these inherent safe actions, we have multiple safety checks at different levels. When we are executing an action, we verify if what we are expecting is what we are seeing on the system and we make sure it is safe to execute every step we are taking.
Q: How do you identify and populate the right metrics for an application?
A: For Kubernetes, there are multiple ways. We know the standard metrics, which come out of different monitoring providers. We have custom integrations with Datadog, New Relic, Prometheus. We have all the standard ones that we identify. On top of it, we connect to the monitoring provider, fetch all the metrics which are available for all the systems. We try to identify each metric automatically, but at the same time, if the user has a very custom metric, which they want Sedai to use, they can always come and configure it.
August 22, 2022
November 25, 2024
In the past few years, we have seen a huge increase in adoption for Kubernetes. Kubernetes is an excellent platform, which provides you with all the facilities to fine-tune your application and run it the way you need. We at Sedai, use Kubernetes for deploying our applications along with surveillance and some other managed services from AWS. One question we get a lot is “Is there a need for another system or an autonomous system to manage Kubernetes?” Kubernetes is in fact a declarative system, an excellent one at which you can specify the needs of running your application, and Kubernetes will take care of the rest. But in fact, it is actually giving you all the bells and whistles required for its users to customize and fine-tune the way you need to use Kubernetes. So effectively, Kubernetes is actually giving all the responsibilities for utilizing it to the users themselves. You can watch the original video here.
So in general, if there is a cluster with around a hundred microservices with approximately two weeks for each cycle, you are looking at around 18,000 different variable combinations to make sure you are running the applications in the best possible way, and at the best cost benefit that you need. This is almost impossible to be handled in a manual way. If you have listened to the previous presentations about the autoscalers, you know what our combinations are going to be available there. And if you get any of those ones incorrect, you are going to have a big negative impact on the availability or on the performance.
When you're trying to make sure you're running Kubernetes in the best possible way, your high level goals are making sure your applications are running at a very high availability and you are running your applications with good performance, the way you need it, by and minimizing the cost of your whole cluster. The natural solution for this is actually to use autoscalers. We already saw how you can utilize HPA/VPA to autoscale your application, and to utilize that at a cluster level, we need to have some kind of a cluster level autoscale. The way you go about doing it is by looking at the resource usage from the application, checking the application behavior, coming up with thresholds and at a pod level autoscale. There are different ways of handling it, but most of the autoscalers look at unscheduled requests. You look at the application requirement and try to supplement with the infrastructure.
What are the steps needed to actually automate and make sure your Kubernetes is running at the right possible way?
As a first step, when we pick something for automation, there are a few things that we look at, the attributes, which are where you choose candidates for automation. Those are repetitive tasks, which are really tedious, time consuming, and naturally it is going to be error prone. And most importantly, these are tasks that do not change. So once you set it up, you don't need to actually worry about it at all. It just keeps on doing clockwork. Some of the common examples that we have seen is automation of bills, CI/CDs, some level of testing can be automated.
To successfully automate Kubernetes and ensure optimal performance and availability, several crucial aspects need to be considered. Expanding upon the attributes we previously discussed, the initial step involves examining application behavior. Gaining insights into how an application functions is achieved by analyzing relevant metrics.
The next important factor is determining the appropriate thresholds for these metrics. By establishing suitable thresholds, adjustments can be made to the number of pods allocated to an application. If additional resources are required, pod numbers can be increased, while in cases where they are not needed, pod numbers can be reduced. At the cluster level, it becomes necessary to modify the underlying infrastructure that supports application requests. Sometimes, the default functionalities provided by Kubernetes may not be sufficient, necessitating the implementation of custom scripted remediations. These remediations are tailored to specific attributes and aid in resolving issues.
Furthermore, it is advisable to implement a notification system to ensure that relevant stakeholders remain informed about the system's status. Effective notifications enable timely awareness of any ongoing developments within the system.
To effectively manage availability and optimize performance, several key steps need to be taken. Let's explore these steps in a way that is easily understandable for your readers:
Choosing the Right Metrics: When monitoring application performance, it's important to select metrics that are relevant to each specific application. Different applications have unique behaviors, so there are no universal metrics that apply to all cases. For example, some applications may benefit from monitoring saturation metrics, CPU usage, or memory utilization, while others prioritize performance over resource usage. Additionally, factors like network usage and IOPS can also impact application behavior and should be considered.
Setting Suitable Thresholds: Even if two applications use the same metrics, their behaviors can differ based on different performance thresholds. Therefore, it is crucial to establish appropriate thresholds for each application. As Kubernetes clusters often consist of diverse workloads, it's essential to allocate the right types of nodes to optimize performance and minimize costs.
Tailored Remediation Approaches: There is no one-size-fits-all solution for fixing issues. Different situations and applications may require different approaches. The objective is to ensure smooth operation rather than immediate bug fixing. By addressing underlying issues and implementing appropriate remediation strategies, you can maintain the desired performance levels.
Effective Alerts & Notifications: Timing is critical when it comes to sending notifications. Sending an excessive number of notifications can lead to them being ignored or dismissed, while insufficient notifications may result in availability problems going unnoticed. It's important to strike the right balance and send timely notifications that provide actionable information.
In the context of optimizing performance, we need to consider the constraints and adjustable variables involved to balance cost containment and performance levels. Two potential solutions are High-Performance Autoscaling (HPA) and Vertical Pod Autoscaling (VPA), which require configuring the appropriate metrics and thresholds to achieve desired outcomes.
Another significant aspect to consider is traffic. As you may have already learned from previous sessions, traffic can have a profound impact on application performance. The same application may exhibit different behavior under varying traffic loads. Seasonality is a crucial factor to consider when optimizing for both performance and cost reduction. Furthermore, the rapid pace of application releases, facilitated by technologies like Kubernetes and various cloud provider services, adds complexity to the optimization process. Even for non-microservices applications, it is common to have an average release cycle of approximately two weeks. The behavior of an application can undergo significant changes from one version to another, making it essential to adapt optimization strategies accordingly.
When it comes to the cluster environment, if you are utilizing cloud products such as Amazon Elastic Kubernetes Service (EKS), you are presented with a multitude of infrastructure options. This abundance of choices is advantageous, but it also presents the challenge of selecting the most appropriate one. In the case of AWS, there are additional choices available to help reduce costs, such as utilizing spot instances instead of on-demand instances. Spot instances are suitable for applications that can tolerate restarts and are less impacted by infrastructure failures. Thus, careful consideration is required to determine the optimal utilization of these options.
In the image below, you will see two configurations. They both are Horizontal Pod Autoscaler configurations. Just look at the highlighted ones. Those are the attributes that need to be configured on a per application and a per revision basis.
The visual representation illustrates the HPA configuration for the same application, highlighting the disparity between two different versions. Despite utilizing the same metric, namely CPU, for HPA, the average utilization varies due to changes in application behavior. These complexities necessitate careful consideration to attain the optimal combination and ensure the smooth operation of the application. The chart depicted in the bottom right corner represents the problem we are endeavoring to solve.
This chart demonstrates the application's latency under a consistent traffic workload. The X-axis represents the number of pods recorded during the measurements, while the blue line depicts the latency decreasing as the number of pods increases, indicating a responsive scaling process. However, beyond a certain point, the latency improvement becomes marginal, plateauing at around 3.5 milliseconds. On the other hand, the cost associated with additional resources exhibits a nearly linear increase as more pods are added. Taking into account the cost budget and the performance Service Level Agreement (SLA), the shaded area represents the critical considerations for achieving the desired outcome.
Determining the optimal configuration that strikes a balance between performance and cost is paramount. The black marker in this illustration represents the ideal position for this specific application. The multitude of combinations and intricacies involved in achieving this alignment underscore the complexity of the task. Failing to achieve the correct balance can have a cumulative impact on both application performance and cost.
In managing HPA and VPA for 100 services, the challenge lies in effectively addressing the unique operational characteristics of each application. Some applications may benefit from increased resource allocation, while others may respond better to scaling the number of pods instead of augmenting CPU capacity. Thus, it's crucial to assess and address these attributes accordingly.
Regarding metrics, a conservative approach will consider three potential combinations: CPU, memory, and a performance metric. Many applications have varying traffic patterns, so let's assume three average traffic patterns. Additionally, there are five infrastructure options (although the actual number may be larger depending on the cloud provider, such as AWS). This results in approximately 18,000 unique combinations to navigate through each month, emphasizing the need to ensure optimal application performance and cost-effectiveness.
The inherent complexity of managing these intricate combinations renders manual handling virtually impractical. Even with the implementation of automated processes, the dynamic nature of the variables involved makes achieving optimal outcomes exceedingly challenging. This is precisely where the integration of an autonomous system becomes invaluable.
Let's revisit the crucial attributes we discussed earlier, which are essential for fine-tuning applications and optimizing costs. The first attribute pertains to metrics observation, and the challenge lies in determining the most relevant metrics. An autonomous system should seamlessly integrate with monitoring providers, comprehensively analyze application behavior, and intelligently select the appropriate metrics that accurately represent the application's performance.
Furthermore, by continuously monitoring and analyzing the application's behavior, the system can dynamically establish thresholds. These thresholds may either be perceived as threshold-free, as they continually adapt, or as thresholds tailored to each application release. The autonomous system's understanding of application behavior becomes instrumental when selecting the most suitable infrastructure for the cluster. This intelligent decision-making process ensures that the chosen infrastructure aligns with the specific requirements of each individual application.
Moreover, not all applications warrant the same remediation measures. Customized remediation solutions may be necessary, and their applicability can vary across different contexts. An autonomous system should possess the capability to identify the appropriate remediation approach for a given situation. Even when functioning autonomously, the system should recognize instances where human intervention is necessary and promptly notify users accordingly.
In summary, while an autonomous system is expected to operate independently, it should also possess the capability to involve users when human intervention is essential.
In addition to the points we have previously discussed, there are a few more aspects to consider regarding an autonomous system. Once an autonomous system comprehends the behavior of an application, it gains the ability to predict how the application will perform under different traffic loads or patterns. By identifying early indicators, the system can proactively anticipate and resolve potential issues before they even occur.
While High-Performance Autoscaling (HPA) configurations and similar mechanisms can be adjusted to be more proactive and preempt issues, they primarily function in a reactive manner, responding after an event has taken place. In contrast, an autonomous system, leveraging machine learning or similar techniques, has the potential to anticipate and predict events, making it a proactive solution
Implementing an autonomous system that operates on behalf of users necessitates the implementation of robust safety checks to ensure system stability and prevent any negative impact or degradation. Emphasizing rigorous safety measures is crucial to maintain a reliable and secure environment. While autonomous systems offer valuable capabilities, it is important to recognize their limitations. They cannot replace every human function, and certain tasks and scenarios surpass their capabilities. Therefore, inherent safety should be a priority for autonomous systems, focusing only on actions they can reliably execute autonomously.
Autonomous systems have the responsibility to recognize their own limitations. Situations may arise that exceed the system's capabilities, requiring human intervention. In such cases, prompt notification and deferral to human expertise are essential. This collaborative approach ensures that the system operates within known boundaries and effectively addresses complex and unique challenges.
One of the primary responsibilities of an autonomous system is to recognize and acknowledge its own limitations. There may be situations that surpass the system's capabilities, necessitating human intervention. In such instances, it is crucial for the autonomous system to promptly notify the user and defer to human expertise. This collaborative approach ensures that the system operates within its known boundaries and effectively addresses complex and unique challenges.
Furthermore, when entrusting an autonomous system to manage a cluster or other systems, having comprehensive configuration capabilities is essential. For example, in the context of Kubernetes, the system should provide a robust configuration interface that allows users to define high-level goals and objectives. This empowers users to align the actions of the autonomous system with their specific requirements and preferences. As the autonomous system continuously manages the Kubernetes cluster, it should incorporate feedback and analyze the effects of its actions to enhance its learning process. By assessing the outcomes and consequences of its decisions, the system can continually improve its performance, adapt to changing conditions, and refine its decision-making algorithms.
Understanding the limitations of an autonomous system is crucial. It should operate within its capabilities and involve human intervention when necessary. The system should offer configuration capabilities to align its actions with user-defined goals, and it should leverage learning mechanisms to enhance its performance and effectiveness over time. By embracing these principles, an autonomous system can operate safely, efficiently, and in harmony with human expertise.
In managing availability in Kubernetes or any other cloud provider, an autonomous system follows a high-level process. It connects to the infrastructure provider or Kubernetes control plane API, understands the topology, identifies applications, discovers metrics, correlates them with applications, and infers application behavior. This comprehensive understanding allows the system to effectively manage availability.
Once the application behavior is understood, it should be able to predict what is going to happen under different situations. It should be able to understand early indicators from metrics and proactively fix issues for you. At the same time, sometimes not all predictions are going to figure out the issues going to happen. It should also be able to detect issues that are happening on a real time basis and make adjustments accordingly.
When considering Kubernetes, there are often questions regarding availability and the role of an autonomous system in addressing these concerns. An illustrative example pertains to resource allocation within Kubernetes, where the specified limits and requests can significantly impact application performance. For instance, if a user allocates insufficient memory limits, the application may encounter out-of-memory kills. Similarly, incorrect CPU thresholds can lead to throttling issues. In such scenarios, an autonomous system should possess the capability to autonomously identify and rectify these problems by reconfiguring the resource allocations, thereby mitigating the occurrence of these issues.
One of the key elements of an autonomous system is its ongoing monitoring and evaluation process. It is not enough for the system to simply implement changes and disengage. Instead, it should continuously observe and assess the outcomes of its actions. When a user performs an action or makes a change in their cluster, they expect the autonomous system to retrospectively analyze the effectiveness of that action. Did it achieve the intended results? Did it have any negative impacts on the system? While it is ideal to prevent issues from occurring, if a problem arises, the system should promptly notify the user for assistance. Additionally, if the observed outcomes do not align with the expected results, the system should be capable of adapting its actions through the use of reinforcement learning techniques.
In the realm of autonomous optimization, the process shares similarities with availability management. The initial steps involve discovering the topology and gaining an understanding of the application behavior.
In the context of autonomous optimization, the process aligns with availability management, sharing similar principles and steps. The initial stages involve discovering the topology and gaining a deep understanding of the application behavior. Furthermore, it is crucial for the system to regularly evaluate the effectiveness of its actions and make adjustments accordingly. If it determines that a particular action is unlikely to yield the desired results, the system should reset and incorporate the learnings from that execution into its future actions. These activities must be performed on an application-by-application basis and with consideration for each release. In the case of availability management, these activities are carried out continuously, 24/7, to ensure a reliable and uninterrupted system operation.
That was the basis of how an autonomous system should work. Sedai is an autonomous application management system, which is built on all the foundation principles that we just went over earlier. It is an application management system. It talks to your infrastructure provider or any other APIs available, understands your application behavior, and it tries to ensure your availability at the highest possible level. The image above shows some recommendations that Sedai is putting out. In this case, it is actually talking about setting the CPU limit, or it is actually seeing high container CPU usage. So you need to adjust the configurations accordingly.
And now, Sedai also have a cluster management, which takes into account all the inferences of the applications, and it brings that intelligence to manage your cluster. Making sure the infrastructure that is used is actually the best possible.
Refer to the image below.
There's some of the application specific recommendations or autonomous actions that the system has taken. In this case, you can see how much performance improvement the system is able to bring in by just reconfiguring. And the cumulative cost improvement is actually shown at the top. On the bottom, you're seeing a similar recommendation or actions that we are taking at a cluster level. It brings in that application awareness into the cluster level optimization.
One of the principles at Sedai is actually to make sure that we make life easier for its customers. So we want to make sure using Sedai is trial free. It's super simple.
One of the things that we always strive to make sure is that onboarding Sedai should be seamless. Using Sedai should be seamless. We are striving at every possible way to make sure you can easily configure Sedai to monitor your cluster. The image above is actually a screen where you add an AWS account into Sedai, three or four fields, which can be obtained from your EKS system and Sedai should be able to monitor your Kubernetes cluster.
We know that customers like flexibility. Our approach is to be as non-intrusive as possible. The pros that they are taking is actually connecting to your cloud provider, understanding your topology, everything using APIs. We have multiple options. We recommend using that agentless option, where we connect to the Kubernetes control plane API, using the network, using a private link public network, VPC peering. At the same time, we understand that there are certain security requirements on companies where they do not want to expose APIs over the network, where we have an agent based solution as well.
As mentioned earlier, we want our system to be highly configurable. We want the users to be able to set high level goals, or if you want to fine-tune and specifically configure your applications on an individual workload basis. You have configuration options available at an account level and individual application or workload level, or you can create custom groups and city goals there, and Sedai similar to how Kubernetes work. We will make sure your goal is met on a continuous basis.
This is just a high level architecture of Sedai, which you might have already read in the previous blogs. It just shows you the different systems that we connect to cloud providers, monitoring providers, your ticketing system, notification systems. If you want to do custom actions, we can also connect to StackStorm, Rundeck where you can customize how the system works.
Q: How do you ensure safety while performing any actions with Sedai?
A: When we built Sedai, we wanted to make sure the system is inherently safe. So we only act on things which can be safely executed. For example, when we identify there is an availability problem on an application, which doesn't have, which is not stateful, for example, there is a pod which is behaving totally weird, and it is safe to restart that we'll go and do it. So inherently, we only pick safe actions. On top of these inherent safe actions, we have multiple safety checks at different levels. When we are executing an action, we verify if what we are expecting is what we are seeing on the system and we make sure it is safe to execute every step we are taking.
Q: How do you identify and populate the right metrics for an application?
A: For Kubernetes, there are multiple ways. We know the standard metrics, which come out of different monitoring providers. We have custom integrations with Datadog, New Relic, Prometheus. We have all the standard ones that we identify. On top of it, we connect to the monitoring provider, fetch all the metrics which are available for all the systems. We try to identify each metric automatically, but at the same time, if the user has a very custom metric, which they want Sedai to use, they can always come and configure it.