Optimize compute, storage and data
Choose copilot or autopilot execution
Continuously improve with reinforcement learning
This article covers KnowBe4‘s experience applying autonomous optimization (both cost optimization and performance optimization) and is based on part of the presentation “Mastering Autonomous Optimization for Amazon ECS” at autocon. The KnowBe4 portion was presented by Nate Singletary, Senior SRE at KnowBe4. You can see the full video here, and read the blog covering more detail on the optimization strategies for Amazon ECS behind KnowBe4’s case here.
KnowBe4 provides the world's largest security-awareness training and simulated phishing platform. KnowBe4 is used by over 34,000 organizations globally, and has the world's largest library of security awareness training content. KnowBe4 ranks at the top of many great places to work lists, and was ranked number one in Energage's top workplaces in the USA.
KnowBe4’s Amazon ECS workloads support a diverse suite of products ranging from security awareness tools to human detection response and more. KnowBe4 also integrates externally with their customers’ security stack to provide real-time coaching to their users in response to risky behavior. All these products are spread across thousands of microservices, functions, and data stores. And these services are all deployed in AWS across several compute and data storage services including Amazon ECS as shown below:
KnowBe4’s platform architecture is straightforward. They commit all their code in GitLab. KnowBe4’s CI/CD runs on GitLab on GitLab runners. Their production workloads are deployed on AWS. Their monitoring, metrics, and alerting is in Datadog (which includes logs and metrics exported from AWS Cloudwatch). They don't use a wide range of vendors and onboarding new vendors is a rare event for the company.
KnowBe4 saw that they had an optimization “void” post their commit, deploy, and monitoring workflow:
KnowBe4 had ECS services running in AWS, and wanted to ensure they're running efficiently. The challenge they faced was knowing if they were in fact running efficiently. And if they're not, how do they react to that and fix the issue?
KnowBe4 was using engineers to fill that void. The engineers had to respond to the feedback from the monitoring system including:
These issues could be impacting customer experience or meaning KnowBe4 was missing out on hundreds of thousands of dollars of cost savings across several services
So the question was how should KnowBe4 fix it? The KnowBe4 engineers had to commit code to update the ECS config, deploy it, and then wait for that feedback to see if they had rightsized correctly. And this was a continuous process across many services - in fact, thousands of microservices and functions. This manual process and feedback cycle was not ideal.
KnowBe4 decided to use Sedai’s autonomous optimization to fill this void. Sedai allowed them to reduce the toil on their engineers and autonomize the feedback loop of checking what impact a change made in production to critical metrics.
The key drivers for KnowBe4 to move to an autonomous platform architecture were three-fold:
Reducing toil for KnowBe4’s engineers would also allow engineers to focus on the things they like to do - releasing new products and features.
KnowBe4 also wanted to make sure their workloads are running efficiently. That meant keeping up release velocity, while keeping cost at the front of their minds while also ensuring their services were performant.
To achieve this, KnowBe4:
To implement autonomous adoption with Sedai, KnowBe4 took a three part Crawl, Walk, Run approach as shown below:
In the first crawl stage, KnowBe4 set up the Sedai integration. They set an initial goal of achieving around 10% cost reduction.
At this stage it gave KnowBe4 the ability to allow Sedai to analyze KnowBe4 workloads, see where KnowBe4 may be overprovisioned, and what the opportunities for cost reduction or performance gains were.
KnowBe4 then enabled autonomous on a set of services. They were not “diving off the deep end” at that stage as these were low-risk services. KnowBe4’s goal with these services was to see how they reacted to the autonomous optimizations.
In the walk stage, KnowBe4 had now seen some evaluations. KnowBe4 had seen some opportunities for significant cost reduction and significant performance gains. KnowBe4 had also seen some realized cost reduction and performance gains in the set of low-risk services that they had enabled.
At this stage KnowBe4 was impressed by the results of autonomous optimization and decided to more aggressively roll out Sedai and decided to turn on autonomous optimization for their flagship products.
Before turning on autonomous optimization, KnowBe4 created groups. KnowBe4 divided these groups by products and regions and set goals for cost and performance that were tailored to the product. KnowBe4 has some products with services that are more latency tolerant, and set a more aggressive cost reduction goal for that service. KnowBe4 also has services where they need to maintain the highest levels of availability and performance and will not be as aggressive with these services. Once these groups were set up and goals defined, KnowBe4 turned on autonomous optimization for them.
In the Run phase KnowBe4 allowed Sedai to “take the wheel”. Services are autonomously optimized by default. If an engineer releases a service in ECS, it's automatically managed by Sedai.
Sedai was integrated across all of KnowBe4’s AWS accounts, and is managing services across all regions.
KnowBe4 is now working towards integrating Sedai into the CI/CD flow so that KnowBe4 will have a fully autonomized feedback loop.
Sedai had now filled KnowBe4’s optimization void.
Below is an example of the opportunities KnowBe4 has seen in Sedai across a group of accounts. Sedai is projecting a 27% cost savings, or over $400,000 in cloud spend which KnowBe4 considers to be a significant reduction in cloud cost.
Below is an example of an individual cluster with a 36% potential saving.
Below is another example showing some of the realized savings at KnowBe4. This shows some of the Lambda services where KnowBe4 has not only reduced cost by 30%, but also reduced duration and increased performance by 86%.
KnowBe4’s highlights now include:
KnowBe4 is now in the process of integrating that back into their CI/CD processes. Once completed they will have a full autonomous workflow. And the IaC would remain the source of truth for KnowBe4’s configs.
April 14, 2024
November 20, 2024
This article covers KnowBe4‘s experience applying autonomous optimization (both cost optimization and performance optimization) and is based on part of the presentation “Mastering Autonomous Optimization for Amazon ECS” at autocon. The KnowBe4 portion was presented by Nate Singletary, Senior SRE at KnowBe4. You can see the full video here, and read the blog covering more detail on the optimization strategies for Amazon ECS behind KnowBe4’s case here.
KnowBe4 provides the world's largest security-awareness training and simulated phishing platform. KnowBe4 is used by over 34,000 organizations globally, and has the world's largest library of security awareness training content. KnowBe4 ranks at the top of many great places to work lists, and was ranked number one in Energage's top workplaces in the USA.
KnowBe4’s Amazon ECS workloads support a diverse suite of products ranging from security awareness tools to human detection response and more. KnowBe4 also integrates externally with their customers’ security stack to provide real-time coaching to their users in response to risky behavior. All these products are spread across thousands of microservices, functions, and data stores. And these services are all deployed in AWS across several compute and data storage services including Amazon ECS as shown below:
KnowBe4’s platform architecture is straightforward. They commit all their code in GitLab. KnowBe4’s CI/CD runs on GitLab on GitLab runners. Their production workloads are deployed on AWS. Their monitoring, metrics, and alerting is in Datadog (which includes logs and metrics exported from AWS Cloudwatch). They don't use a wide range of vendors and onboarding new vendors is a rare event for the company.
KnowBe4 saw that they had an optimization “void” post their commit, deploy, and monitoring workflow:
KnowBe4 had ECS services running in AWS, and wanted to ensure they're running efficiently. The challenge they faced was knowing if they were in fact running efficiently. And if they're not, how do they react to that and fix the issue?
KnowBe4 was using engineers to fill that void. The engineers had to respond to the feedback from the monitoring system including:
These issues could be impacting customer experience or meaning KnowBe4 was missing out on hundreds of thousands of dollars of cost savings across several services
So the question was how should KnowBe4 fix it? The KnowBe4 engineers had to commit code to update the ECS config, deploy it, and then wait for that feedback to see if they had rightsized correctly. And this was a continuous process across many services - in fact, thousands of microservices and functions. This manual process and feedback cycle was not ideal.
KnowBe4 decided to use Sedai’s autonomous optimization to fill this void. Sedai allowed them to reduce the toil on their engineers and autonomize the feedback loop of checking what impact a change made in production to critical metrics.
The key drivers for KnowBe4 to move to an autonomous platform architecture were three-fold:
Reducing toil for KnowBe4’s engineers would also allow engineers to focus on the things they like to do - releasing new products and features.
KnowBe4 also wanted to make sure their workloads are running efficiently. That meant keeping up release velocity, while keeping cost at the front of their minds while also ensuring their services were performant.
To achieve this, KnowBe4:
To implement autonomous adoption with Sedai, KnowBe4 took a three part Crawl, Walk, Run approach as shown below:
In the first crawl stage, KnowBe4 set up the Sedai integration. They set an initial goal of achieving around 10% cost reduction.
At this stage it gave KnowBe4 the ability to allow Sedai to analyze KnowBe4 workloads, see where KnowBe4 may be overprovisioned, and what the opportunities for cost reduction or performance gains were.
KnowBe4 then enabled autonomous on a set of services. They were not “diving off the deep end” at that stage as these were low-risk services. KnowBe4’s goal with these services was to see how they reacted to the autonomous optimizations.
In the walk stage, KnowBe4 had now seen some evaluations. KnowBe4 had seen some opportunities for significant cost reduction and significant performance gains. KnowBe4 had also seen some realized cost reduction and performance gains in the set of low-risk services that they had enabled.
At this stage KnowBe4 was impressed by the results of autonomous optimization and decided to more aggressively roll out Sedai and decided to turn on autonomous optimization for their flagship products.
Before turning on autonomous optimization, KnowBe4 created groups. KnowBe4 divided these groups by products and regions and set goals for cost and performance that were tailored to the product. KnowBe4 has some products with services that are more latency tolerant, and set a more aggressive cost reduction goal for that service. KnowBe4 also has services where they need to maintain the highest levels of availability and performance and will not be as aggressive with these services. Once these groups were set up and goals defined, KnowBe4 turned on autonomous optimization for them.
In the Run phase KnowBe4 allowed Sedai to “take the wheel”. Services are autonomously optimized by default. If an engineer releases a service in ECS, it's automatically managed by Sedai.
Sedai was integrated across all of KnowBe4’s AWS accounts, and is managing services across all regions.
KnowBe4 is now working towards integrating Sedai into the CI/CD flow so that KnowBe4 will have a fully autonomized feedback loop.
Sedai had now filled KnowBe4’s optimization void.
Below is an example of the opportunities KnowBe4 has seen in Sedai across a group of accounts. Sedai is projecting a 27% cost savings, or over $400,000 in cloud spend which KnowBe4 considers to be a significant reduction in cloud cost.
Below is an example of an individual cluster with a 36% potential saving.
Below is another example showing some of the realized savings at KnowBe4. This shows some of the Lambda services where KnowBe4 has not only reduced cost by 30%, but also reduced duration and increased performance by 86%.
KnowBe4’s highlights now include:
KnowBe4 is now in the process of integrating that back into their CI/CD processes. Once completed they will have a full autonomous workflow. And the IaC would remain the source of truth for KnowBe4’s configs.