AI-driven optimization delivers substantial cost savings and performance improvements while reducing engineer toil for AWS ECS and Lambda workloads
Cloud Cost Savings
Services Optimized Autonomously
Autonomous Actions in first 90 days
Payback Period
Cloud Cost Optimization
Performance Improvement
Ops Productivity
Release Quality
Autonomous Optimization
Autonomous Remediation
Release Intelligence
Amazon ECS
AWS Lambda
AWS CloudWatch
Datadog
Gitlab
Cybersecurity
North America
KnowBe4, a leader in security awareness training, experienced rapid growth that led to significant challenges in managing and optimizing their cloud infrastructure. As their customer base expanded to over 70,000 organizations worldwide, the company faced unprecedented scaling issues across their AWS environment.
Nate Singletary, Staff Site Reliability Engineer at KnowBe4, succinctly described the challenge they faced: "We have ECS services running in AWS, and we want to ensure they're running efficiently. How do we know if they are? And if they're not, how do we react to that? How do we fix it?"
Matthew Duren, Sr. Director of Software Engineering at KnowBe4, outlined the scale of their operations: "We have something like 70,000 customers across the world. So we're seeing tons of growth right now, especially internationally. The US environment is definitely our largest. Just from our day-to-day, we peak out at thousands and thousands of requests per second."
The company's infrastructure growth was staggering, with a 58% year-over-year increase in ECS usage, managing over 3,000 services and handling 2,000-4,000+ peak tasks daily. Their Lambda usage saw an even more dramatic surge, with a 422% year-over-year growth, encompassing over 2,500 functions and processing over 250 million Lambda invocations daily.
This rapid expansion created a complex web of thousands of microservices across ECS and Lambda, with frequent code deployments averaging every 20 minutes. The frequent releases and the need for high performance in real-time cybersecurity delivery put immense pressure on the engineering team to optimize resources continually. But manual optimization processes would have been time-consuming and inefficient, especially given KnowBe4’s scale of operations.
Adding to the complexity, KnowBe4 is a heavy user of AI and ML in their workloads. Their AI-driven services include AIDA (Artificial Intelligence Driven Agent), which runs in the background to allow customers to use AI in their day-to-day platform usage. They also employ a Virtual Risk Officer that uses AI to assign risk scores to users based on their positions and behaviors. Other AI-powered features include PhishML for automated dispositioning of potential phishing emails, and various content selection tools. The sophisticated nature of these AI workloads added another layer of complexity to their infrastructure management needs.
Matt Duren emphasized the challenge of scaling their operations: "We sit and we try to scale our infrastructure, to scale our software, to build software that will scale to any number of users. But until now, until we started using Sedai, we really had no solution for actually scaling the people side of things, the team."
It became increasingly clear that the manual approach to optimization was not scalable, preventing KnowBe4 from fully optimizing their cloud resources and maintaining peak performance for their customers while managing costs effectively. This complex set of challenges set the stage for KnowBe4's exploration of autonomous optimization solutions.
To address these multifaceted challenges, KnowBe4 decided to implement Sedai's autonomous optimization platform. They followed a carefully planned approach, which Matt Duren described as: "We took some steps to mitigate what his fears were, what our fears were, and decided we would start small, we would choose a specific set of services to apply this autonomous optimization to."
KnowBe4 adopted a phased implementation strategy, which they referred to as Crawl, Walk, Run:
1. Crawl:
- Set up the Sedai integration with their AWS environment
- Established initial cost-saving goals
- Enabled autonomous optimization on a set of low-risk services to evaluate the impact
2. Walk:
- Analyzed results from the initial optimization efforts
- Expanded Sedai's implementation to include flagship products
- Created groups divided by products and regions
- Set tailored goals for cost and performance based on service requirements
3. Run:
- Fully embraced Sedai's autonomous optimization across their infrastructure
- Configured services to be autonomously optimized by default upon deployment
- Integrated Sedai across all AWS accounts and regions
KnowBe4's optimization strategy encompassed several key areas. They focused on service optimization, configuring horizontal and vertical scaling for optimal cost and performance, fine-tuning memory, CPU, and task counts. Container instance optimization was implemented, selecting instance types on an application-aware basis and factoring in app-level latency. The team also explored various purchasing options, identifying the most cost-effective combination of on-demand and savings plans based on predicted traffic patterns.
Risk mitigation was a crucial aspect of the implementation. KnowBe4 took a cautious approach, starting with a small subset of services in production and gradually expanding to dev/test environments. Matt Duren commented on this strategy: "We wanted to start with something not dev, not test, not a lower environment where there's not real traffic coming to it, even if it's heavily, heavily, heavily tested by a software engineering test team or QA team. We still knew we wanted to serve production traffic through a service that was being optimized by Sedai."
To ensure smooth adoption, KnowBe4 integrated Sedai into their CI/CD process, creating a fully autonomous workflow. They also implemented release evaluation to automatically assess the impact of new deployments on cost, performance, and availability. This approach helped align the development teams with the new autonomous optimization paradigm.
Sedai was deployed to optimize KnowBe4's ECS Fargate workloads and Lambda functions, autonomously rightsizing services, adjusting auto-scaling configurations, and managing resource allocation. The platform utilized AI and machine learning techniques to analyze service behavior, predict resource needs, and make real-time adjustments to optimize both cost and performance.
KnowBe4's implementation of Sedai's autonomous optimization platform yielded impressive results across multiple areas, significantly impacting their cloud operations, cost management, and overall efficiency. Matt Duren shared the significant outcomes: "We're happy to report that we achieved ROI in just five months. And [Finance] is very pleased about that. We actually had to borrow from our AWS budget in order to get our finance teams to approve taking on Sedai as a vendor. So it was a big risk that we took. And we're really happy that we did." This rapid return on investment validated KnowBe4's decision to adopt autonomous optimization and exceeded their initial expectations.
KnowBe4 successfully managed their rapid growth, handling a 58% year-over-year increase in ECS usage and a 422% growth in Lambda usage. The continuous optimization ensured that resources were always aligned with current needs, even as traffic patterns and application behaviors changed. This adaptability was crucial in supporting KnowBe4's expanding customer base and evolving service offerings.
The company achieved an overall 27% cost savings across their cloud infrastructure, demonstrating the significant impact of autonomous optimization at scale. When looking at individual services and environments, the savings were even more dramatic. In development environments, some ECS services saw up to 87% cost reduction, while in production, savings reached up to 50%. These significant savings allowed KnowBe4 to reallocate resources to other strategic initiatives and support their continued growth.
Nate Singletary provided a comprehensive overview of the results: "Of the just under 9,500 services in Sedai that we have, we're at 98% autonomous optimization. We've had over 1,100 autonomous actions in three months. We're projecting, again, 27% cost reduction." This high level of autonomous operation significantly reduced the manual workload on the engineering team.
Lambda functions saw particularly impressive improvements, with some cases achieving up to 99.3% cost savings. Matt provided a specific example that highlighted both cost and performance benefits: "This is a particular Lambda function that serves a really specific purpose in our production environment. We saw a 31% cost decrease for that function. But again, we saw a 54% decrease in the latency." This dual improvement in cost and performance was a key factor in the success of the implementation.
Performance enhancements were equally impressive, with significant latency reductions across services leading to improved customer experience. Matt shared a striking example that demonstrated the magnitude of these improvements: "We took it from an average response time of 18.5 seconds to 80 milliseconds or so. A 99.5% duration reduction." Such dramatic performance improvements not only enhanced the user experience but also allowed KnowBe4 to serve their customers more efficiently, supporting their mission of providing real-time cybersecurity training and awareness.
Operational efficiency saw a massive boost, with 98% of KnowBe4's 9,491 services now running autonomously. This high level of autonomous operation significantly reduced the manual workload on the engineering team. Over 1,100 autonomous actions were performed in just 3 months. This dramatic increase in efficiency allowed for enhanced engineer productivity, as Matt commented: "By having Sedai in place, we're not just saving money, we're preventing would-be customer problems before they become an issue." This proactive approach to optimization not only improved efficiency but also enhanced the reliability and stability of KnowBe4's services.
Release management saw significant improvements with the implementation of automatic evaluation for every new release. This system flagged releases with major deviations to application developers, increasing release confidence and enabling faster innovation. The ability to quickly identify and address potential issues in new releases helped KnowBe4 maintain their rapid development pace while ensuring the quality and performance of their services.
Reflecting on the impact of autonomous optimization, Nate Singletary highlighted their key objectives: "We want to reduce toil on our engineers. Allow them to focus on the things they like to do, releasing new products and features. And we also want to make sure our workloads are running efficiently. That means keep the velocity, keep releases coming, but keep costs at the front of our minds while also ensuring our services are performant." The implementation of Sedai's platform has allowed KnowBe4 to successfully meet these objectives, balancing cost efficiency with performance and enabling their engineering team to focus on high-value tasks.
Matt Duren summarized the impact on their SRE team, highlighting the shift from routine tasks to more valuable work: "I think the biggest change that I've seen is my teams are able to work on a lot more valuable projects. So there's always a ton of toil. And especially Google calls our approach kind of the kitchen sink SRE approach where no matter what kind of dirty dish you have, you just put it in the kitchen sink. And we found that to be super effective. But it does result in a lot of toil." The reduction in toil and the ability to focus on more strategic projects not only improved team productivity but also likely contributed to improved job satisfaction and reduced burnout risk among the engineering team.
The success of this implementation has positioned KnowBe4 to efficiently scale their cloud infrastructure while maintaining high performance and availability for their customers. By embracing autonomous cloud management, KnowBe4 has not only optimized their current operations but also laid the groundwork for future growth and innovation in their security awareness training platform. The ability to automatically optimize their AI and ML workloads alongside their more traditional services has given them a competitive edge in delivering cutting-edge cybersecurity solutions.
When asked for his recommendation to teams considering autonomous optimization, Matt Duren enthusiastically stated: "Just Do It! .. we've had an incredible journey with Sedai so far. It's been short. It's been very, very easy for us to implement." This endorsement underscores the positive impact and ease of implementation that KnowBe4 experienced with Sedai's autonomous optimization platform.