November 20, 2024
September 24, 2024
November 20, 2024
September 24, 2024
Optimize compute, storage and data
Choose copilot or autopilot execution
Continuously improve with reinforcement learning
This article is based on an edit transcript of a talk at Sedai's autocon/23 conference by Ramesh Nampelly.
Palo Alto Networks has become a leading force in cybersecurity by consistently innovating and optimizing its infrastructure to meet modern challenges. As the company’s product portfolio grows and the demand for reliability increases, it has become imperative to streamline operations, reduce costs Operational Excellence, and enhance performance. This has led to the development of an Autonomous Platform designed to optimize Site Reliability Engineering (SRE) operations and manage cloud infrastructure efficiently.
The development of the Autonomous Platform stemmed from several challenges that arose as we rapidly grew over the past few years. This platform has been instrumental in addressing these challenges, particularly in the areas of SRE and operational excellence. While it benefits the entire engineering team, today I’ll focus on its impact on SREs and production operations.
Let’s dive into the challenges that led to the creation of this platform.
During the pandemic, Palo Alto Networks experienced a 5x growth, which introduced significant operational and infrastructural challenges. This rapid expansion highlighted gaps in our systems and processes, which we needed to address swiftly to maintain the reliability and efficiency of our services.
One of the immediate effects of our rapid growth was the dramatic increase in our cloud spending. As we moved more workloads from data centers to the cloud, costs grew sharply, requiring us to balance financial optimization with maintaining the reliability of our services. Ensuring this balance became a shared responsibility between our SRE and FinOps teams, adding additional pressure on engineering.
With this increased workload, engineers began to experience fatigue. The demands of 24/7 operations, coupled with the sheer scale of our services, led to burnout. It was clear that without a solution, the growing responsibilities would continue to overwhelm the teams responsible for maintaining operational stability.
Palo Alto Networks operates in a unique environment compared to other companies. We don't just offer a single SaaS solution but manage a wide array of products and services:
This complexity meant that our SRE and operations teams faced an enormous challenge in keeping services running smoothly and efficiently, while also managing costs.
As we scaled, maintaining our service-level agreements (SLAs) with customers became more demanding. Our teams had to ensure high availability for critical services, all while balancing the costs associated with our cloud usage. This was particularly challenging as we handled increasing traffic volumes and workloads across a diverse product suite.
In addition to this, the teams faced growing pressure to collaborate with FinOps to find ways to reduce cloud expenditure without compromising on service reliability. Managing this balance added a new layer of responsibility to the teams already tasked with maintaining operational excellence.
This heavy workload and constant pressure led to burnout among engineers. Working around the clock to support services, many team members struggled to maintain the necessary pace, which further underscored the need for a more efficient, automated approach to managing our infrastructure.
As we scaled Palo Alto Networks, one of the core areas we focused on optimizing was our Site Reliability Engineering (SRE) function. The complexity of our environment, combined with rapid growth, exposed several key challenges that our SREs were facing. Addressing these challenges became a priority as they impacted both productivity and operational efficiency.
Let’s walk through the most pressing SRE challenges that the Autonomous Platform is designed to solve.
The first and perhaps most fundamental challenge is toil. In the context of SRE, toil refers to repetitive tasks that engineers must perform manually again and again. These tasks, often operational in nature, do not add long-term value and can lead to significant stress and burnout among the team. Tasks that could potentially be automated end up being performed manually, which not only wastes valuable time but also causes frustration among engineers who feel like they are unable to contribute to higher-value work.
Toil is a major source of inefficiency, and reducing it is essential for improving the well-being of our SRE teams as well as overall system reliability.
Another significant issue we’ve encountered is the use of isolated, disconnected tools across teams. Engineers often develop tools on an ad-hoc basis to meet immediate needs, but without the typical software development processes—like versioning, CI/CD pipelines, or guardrails. This has led to a "kitchen sink" of tools, many of which aren’t properly maintained or integrated into a cohesive system.
The result is an environment where new engineers find it difficult to navigate and understand the tooling landscape. Furthermore, these fragmented tools can sometimes introduce errors in production, adding another layer of operational risk.
Over time, managing this growing collection of isolated tools has added considerable overhead. As new tools are added without careful management, their complexity accumulates. This introduces technical debt, where maintaining these tools requires additional effort, draining time and resources from the SRE teams. Without proper governance, what starts as a helpful tool for solving a specific problem can become a liability over time.
At Palo Alto Networks, we manage over 30 different products, each with its own tech stack, architecture, and unique customer problems to solve. This diversity creates a significant challenge for our SRE teams, as supporting one product often does not translate into expertise in another. An engineer who is proficient in maintaining one product may find themselves starting from scratch when working with a different one, leading to inefficiencies and gaps in operational coverage.
Finally, as our customer base and workloads continue to expand, scaling the SRE team linearly is simply not feasible. The rate of growth in our operations far outpaces the ability to hire and onboard new engineers. This means that without a robust platform to help manage the increasing complexity, we risk overloading our existing SRE teams, exacerbating the problems of toil and burnout.
The Autonomous Platform is rooted in a clear vision and mission, designed to revolutionize the way Site Reliability Engineers (SREs) and production-supporting engineers work by leveraging production data in an autonomous manner. This allows organizations to scale their operations without a linear increase in resources, effectively supporting 10x customer growth.
The Autonomous Platform envisions a future where production data is fully and autonomously utilized to provide "best-in-class SRE support." The platform aims to enable sub-linear growth in resource consumption while supporting 10x customer scale. By automating many routine processes, the platform eliminates manual interventions, allowing engineers to focus on more strategic tasks.
The platform’s mission is to develop tools that empower SREs and production engineers by providing autonomous capabilities. These capabilities are designed to boost productivity, efficiency, and overall operational quality. By eliminating the repetitive toil often associated with daily operations, the platform helps engineers maintain higher service reliability and quality.
To ensure the successful implementation of the platform, four core operational excellence goals were established:
The Autonomous Platform's core purpose is to help organizations maintain service reliability and performance as they scale, all while managing operational costs. The goals outlined above ensure that organizations are prepared to detect and address issues faster, resolve them efficiently, and provide a seamless experience for end users—all while maintaining tight control over operational costs.
By integrating these capabilities into the platform, SREs, developers, and engineers alike can better understand the impact of their work on infrastructure and costs, ensuring that resources are used optimally.
When designing and building a platform intended to support modern enterprise needs, a set of clear architectural principles and foundational goals is necessary. At Palo Alto Networks, the Autonomous Platform has been built with a focus on providing a resilient, scalable, and modular architecture. Here, we’ll explore the platform’s key goals, approaches, and technology stack, along with the key capabilities of the platform that have been developed to streamline production and operations.
The first step in developing an enterprise-grade platform is establishing clear architectural goals. At Palo Alto Networks, the following goals were prioritized:
The architecture of the Autonomous Platform adheres to a modular and loosely coupled design, ensuring flexibility and adaptability across various products. Below are some of the core approaches that guide the platform’s structure:
The choice of core technologies underpins the platform's architecture, providing essential capabilities for observability, automation, and policy management. Here’s a breakdown of the technologies in use:
The Autonomous Platform brings together resource management, infrastructure management, and production management under a unified framework. This integration is achieved via the Developer Portal, which offers:
Other key capabilities of the platform include:
The Autonomous Platform built by Palo Alto Networks is not only a technical achievement but a forward-thinking solution that combines scalability, extensibility, and ease of use for engineers. With a solid foundation in modular design and a carefully chosen tech stack, it empowers Service Reliabilitys and developers alike to enhance system performance, manage costs, and automate repetitive tasks. The continued evolution of this platform ensures that as enterprise demands grow, the tools to support them will scale efficiently and effectively.
At Palo Alto Networks, cost management has become a critical part of our Autonomous Platform. While much of the platform is developed in-house, we’ve adopted an extensible framework that allows integration with external vendors when it makes sense. This helps us focus engineering efforts where they are most needed while leveraging third-party solutions for specific needs.
One key area of integration is cost management for serverless and Kubernetes workloads. While open-source tools like OpenCost handle basic cost tracking, optimizing serverless costs presented challenges. After evaluating various solutions, we integrated Sedai to optimize both cost and performance for our serverless operations.
We are now extending Sedai to manage Kubernetes workloads, ensuring cost efficiency as we scale.
By integrating Sedai, we've streamlined our cost management approach, allowing us to focus on innovation while keeping operations efficient and cost-effective.
When managing serverless environments, we faced several key challenges that demanded constant attention. The dynamic nature of serverless functions, such as AWS Lambda and Google Cloud Functions, made optimizing performance and controlling costs particularly tricky. Here's a quick summary of the main issues:
These challenges repeated with each new release, stretching our SRE team’s bandwidth. We needed a solution that could optimize performance while managing costs efficiently.
After reviewing various application performance management (APM) and cost management tools, we concluded that traditional solutions were either too limited or reactive for our platform’s needs. Upon discovering Sedai, we identified it as the best vendor to integrate with our core platform for optimizing cost management, thanks to its autonomous approach. Here's what we found:
After evaluating several tools, we found that Sedai offered the best solution for optimizing serverless workloads. Traditional APM and cost management tools were either too limited or reactive, but Sedai's autonomous platform provided:
We’ve seen positive results from integrating Sedai and plan to expand its use across our platform to further streamline serverless operations while keeping costs under control.
Cloud cost optimization is crucial, particularly with Kubernetes and serverless environments where complexities can stack up quickly. In this section, we explore a practical approach for optimizing costs and improving performance, starting with Kubernetes and moving into serverless functions. This blog will outline key strategies, challenges, and optimization techniques.
When managing Kubernetes clusters, one major task is optimizing resource allocation without impacting performance. Here’s a breakdown of the optimization approach:
This approach combines both proactive and reactive strategies, ensuring that both initial and ongoing optimizations are addressed, adapting dynamically to system demands.
While Kubernetes offers a scalable and flexible infrastructure, there are specific challenges associated with managing its costs:
Serverless architectures also benefit from similar optimization strategies. The following results have been observed:
Serverless cost optimization follows a structured, autonomous approach:
This approach helps achieve continuous, automated optimization without requiring constant manual intervention. By addressing the core areas of memory, CPU, and concurrency, businesses can see noticeable improvements in both performance and cost-efficiency.
As we embark on our Kubernetes cost optimization journey, early results provide both insight and encouragement. Currently, with a limited number of Kubernetes environments onboarded, we have realized approximately 2% in cost savings. While this may seem modest, it is important to note that we are only in the early stages of this process, and we anticipate significant improvements as we continue.
These early results show promise, and by scaling these strategies across all clusters, we aim to unlock even more substantial savings.
In conclusion, Palo Alto Networks has successfully addressed the challenges brought by rapid growth and cloud infrastructure expansion through the development of its Autonomous Platform. This platform has streamlined SRE operations, reduced costs, and improved performance by automating repetitive tasks and optimizing resource management. By integrating tools like Sedai for serverless and Kubernetes optimization, the company has further enhanced cost efficiency while maintaining high service reliability. As Palo Alto Networks continues to evolve, the Autonomous Platform plays a crucial role in ensuring scalable, resilient operations that meet the demands of a growing customer base.
September 24, 2024
November 20, 2024
This article is based on an edit transcript of a talk at Sedai's autocon/23 conference by Ramesh Nampelly.
Palo Alto Networks has become a leading force in cybersecurity by consistently innovating and optimizing its infrastructure to meet modern challenges. As the company’s product portfolio grows and the demand for reliability increases, it has become imperative to streamline operations, reduce costs Operational Excellence, and enhance performance. This has led to the development of an Autonomous Platform designed to optimize Site Reliability Engineering (SRE) operations and manage cloud infrastructure efficiently.
The development of the Autonomous Platform stemmed from several challenges that arose as we rapidly grew over the past few years. This platform has been instrumental in addressing these challenges, particularly in the areas of SRE and operational excellence. While it benefits the entire engineering team, today I’ll focus on its impact on SREs and production operations.
Let’s dive into the challenges that led to the creation of this platform.
During the pandemic, Palo Alto Networks experienced a 5x growth, which introduced significant operational and infrastructural challenges. This rapid expansion highlighted gaps in our systems and processes, which we needed to address swiftly to maintain the reliability and efficiency of our services.
One of the immediate effects of our rapid growth was the dramatic increase in our cloud spending. As we moved more workloads from data centers to the cloud, costs grew sharply, requiring us to balance financial optimization with maintaining the reliability of our services. Ensuring this balance became a shared responsibility between our SRE and FinOps teams, adding additional pressure on engineering.
With this increased workload, engineers began to experience fatigue. The demands of 24/7 operations, coupled with the sheer scale of our services, led to burnout. It was clear that without a solution, the growing responsibilities would continue to overwhelm the teams responsible for maintaining operational stability.
Palo Alto Networks operates in a unique environment compared to other companies. We don't just offer a single SaaS solution but manage a wide array of products and services:
This complexity meant that our SRE and operations teams faced an enormous challenge in keeping services running smoothly and efficiently, while also managing costs.
As we scaled, maintaining our service-level agreements (SLAs) with customers became more demanding. Our teams had to ensure high availability for critical services, all while balancing the costs associated with our cloud usage. This was particularly challenging as we handled increasing traffic volumes and workloads across a diverse product suite.
In addition to this, the teams faced growing pressure to collaborate with FinOps to find ways to reduce cloud expenditure without compromising on service reliability. Managing this balance added a new layer of responsibility to the teams already tasked with maintaining operational excellence.
This heavy workload and constant pressure led to burnout among engineers. Working around the clock to support services, many team members struggled to maintain the necessary pace, which further underscored the need for a more efficient, automated approach to managing our infrastructure.
As we scaled Palo Alto Networks, one of the core areas we focused on optimizing was our Site Reliability Engineering (SRE) function. The complexity of our environment, combined with rapid growth, exposed several key challenges that our SREs were facing. Addressing these challenges became a priority as they impacted both productivity and operational efficiency.
Let’s walk through the most pressing SRE challenges that the Autonomous Platform is designed to solve.
The first and perhaps most fundamental challenge is toil. In the context of SRE, toil refers to repetitive tasks that engineers must perform manually again and again. These tasks, often operational in nature, do not add long-term value and can lead to significant stress and burnout among the team. Tasks that could potentially be automated end up being performed manually, which not only wastes valuable time but also causes frustration among engineers who feel like they are unable to contribute to higher-value work.
Toil is a major source of inefficiency, and reducing it is essential for improving the well-being of our SRE teams as well as overall system reliability.
Another significant issue we’ve encountered is the use of isolated, disconnected tools across teams. Engineers often develop tools on an ad-hoc basis to meet immediate needs, but without the typical software development processes—like versioning, CI/CD pipelines, or guardrails. This has led to a "kitchen sink" of tools, many of which aren’t properly maintained or integrated into a cohesive system.
The result is an environment where new engineers find it difficult to navigate and understand the tooling landscape. Furthermore, these fragmented tools can sometimes introduce errors in production, adding another layer of operational risk.
Over time, managing this growing collection of isolated tools has added considerable overhead. As new tools are added without careful management, their complexity accumulates. This introduces technical debt, where maintaining these tools requires additional effort, draining time and resources from the SRE teams. Without proper governance, what starts as a helpful tool for solving a specific problem can become a liability over time.
At Palo Alto Networks, we manage over 30 different products, each with its own tech stack, architecture, and unique customer problems to solve. This diversity creates a significant challenge for our SRE teams, as supporting one product often does not translate into expertise in another. An engineer who is proficient in maintaining one product may find themselves starting from scratch when working with a different one, leading to inefficiencies and gaps in operational coverage.
Finally, as our customer base and workloads continue to expand, scaling the SRE team linearly is simply not feasible. The rate of growth in our operations far outpaces the ability to hire and onboard new engineers. This means that without a robust platform to help manage the increasing complexity, we risk overloading our existing SRE teams, exacerbating the problems of toil and burnout.
The Autonomous Platform is rooted in a clear vision and mission, designed to revolutionize the way Site Reliability Engineers (SREs) and production-supporting engineers work by leveraging production data in an autonomous manner. This allows organizations to scale their operations without a linear increase in resources, effectively supporting 10x customer growth.
The Autonomous Platform envisions a future where production data is fully and autonomously utilized to provide "best-in-class SRE support." The platform aims to enable sub-linear growth in resource consumption while supporting 10x customer scale. By automating many routine processes, the platform eliminates manual interventions, allowing engineers to focus on more strategic tasks.
The platform’s mission is to develop tools that empower SREs and production engineers by providing autonomous capabilities. These capabilities are designed to boost productivity, efficiency, and overall operational quality. By eliminating the repetitive toil often associated with daily operations, the platform helps engineers maintain higher service reliability and quality.
To ensure the successful implementation of the platform, four core operational excellence goals were established:
The Autonomous Platform's core purpose is to help organizations maintain service reliability and performance as they scale, all while managing operational costs. The goals outlined above ensure that organizations are prepared to detect and address issues faster, resolve them efficiently, and provide a seamless experience for end users—all while maintaining tight control over operational costs.
By integrating these capabilities into the platform, SREs, developers, and engineers alike can better understand the impact of their work on infrastructure and costs, ensuring that resources are used optimally.
When designing and building a platform intended to support modern enterprise needs, a set of clear architectural principles and foundational goals is necessary. At Palo Alto Networks, the Autonomous Platform has been built with a focus on providing a resilient, scalable, and modular architecture. Here, we’ll explore the platform’s key goals, approaches, and technology stack, along with the key capabilities of the platform that have been developed to streamline production and operations.
The first step in developing an enterprise-grade platform is establishing clear architectural goals. At Palo Alto Networks, the following goals were prioritized:
The architecture of the Autonomous Platform adheres to a modular and loosely coupled design, ensuring flexibility and adaptability across various products. Below are some of the core approaches that guide the platform’s structure:
The choice of core technologies underpins the platform's architecture, providing essential capabilities for observability, automation, and policy management. Here’s a breakdown of the technologies in use:
The Autonomous Platform brings together resource management, infrastructure management, and production management under a unified framework. This integration is achieved via the Developer Portal, which offers:
Other key capabilities of the platform include:
The Autonomous Platform built by Palo Alto Networks is not only a technical achievement but a forward-thinking solution that combines scalability, extensibility, and ease of use for engineers. With a solid foundation in modular design and a carefully chosen tech stack, it empowers Service Reliabilitys and developers alike to enhance system performance, manage costs, and automate repetitive tasks. The continued evolution of this platform ensures that as enterprise demands grow, the tools to support them will scale efficiently and effectively.
At Palo Alto Networks, cost management has become a critical part of our Autonomous Platform. While much of the platform is developed in-house, we’ve adopted an extensible framework that allows integration with external vendors when it makes sense. This helps us focus engineering efforts where they are most needed while leveraging third-party solutions for specific needs.
One key area of integration is cost management for serverless and Kubernetes workloads. While open-source tools like OpenCost handle basic cost tracking, optimizing serverless costs presented challenges. After evaluating various solutions, we integrated Sedai to optimize both cost and performance for our serverless operations.
We are now extending Sedai to manage Kubernetes workloads, ensuring cost efficiency as we scale.
By integrating Sedai, we've streamlined our cost management approach, allowing us to focus on innovation while keeping operations efficient and cost-effective.
When managing serverless environments, we faced several key challenges that demanded constant attention. The dynamic nature of serverless functions, such as AWS Lambda and Google Cloud Functions, made optimizing performance and controlling costs particularly tricky. Here's a quick summary of the main issues:
These challenges repeated with each new release, stretching our SRE team’s bandwidth. We needed a solution that could optimize performance while managing costs efficiently.
After reviewing various application performance management (APM) and cost management tools, we concluded that traditional solutions were either too limited or reactive for our platform’s needs. Upon discovering Sedai, we identified it as the best vendor to integrate with our core platform for optimizing cost management, thanks to its autonomous approach. Here's what we found:
After evaluating several tools, we found that Sedai offered the best solution for optimizing serverless workloads. Traditional APM and cost management tools were either too limited or reactive, but Sedai's autonomous platform provided:
We’ve seen positive results from integrating Sedai and plan to expand its use across our platform to further streamline serverless operations while keeping costs under control.
Cloud cost optimization is crucial, particularly with Kubernetes and serverless environments where complexities can stack up quickly. In this section, we explore a practical approach for optimizing costs and improving performance, starting with Kubernetes and moving into serverless functions. This blog will outline key strategies, challenges, and optimization techniques.
When managing Kubernetes clusters, one major task is optimizing resource allocation without impacting performance. Here’s a breakdown of the optimization approach:
This approach combines both proactive and reactive strategies, ensuring that both initial and ongoing optimizations are addressed, adapting dynamically to system demands.
While Kubernetes offers a scalable and flexible infrastructure, there are specific challenges associated with managing its costs:
Serverless architectures also benefit from similar optimization strategies. The following results have been observed:
Serverless cost optimization follows a structured, autonomous approach:
This approach helps achieve continuous, automated optimization without requiring constant manual intervention. By addressing the core areas of memory, CPU, and concurrency, businesses can see noticeable improvements in both performance and cost-efficiency.
As we embark on our Kubernetes cost optimization journey, early results provide both insight and encouragement. Currently, with a limited number of Kubernetes environments onboarded, we have realized approximately 2% in cost savings. While this may seem modest, it is important to note that we are only in the early stages of this process, and we anticipate significant improvements as we continue.
These early results show promise, and by scaling these strategies across all clusters, we aim to unlock even more substantial savings.
In conclusion, Palo Alto Networks has successfully addressed the challenges brought by rapid growth and cloud infrastructure expansion through the development of its Autonomous Platform. This platform has streamlined SRE operations, reduced costs, and improved performance by automating repetitive tasks and optimizing resource management. By integrating tools like Sedai for serverless and Kubernetes optimization, the company has further enhanced cost efficiency while maintaining high service reliability. As Palo Alto Networks continues to evolve, the Autonomous Platform plays a crucial role in ensuring scalable, resilient operations that meet the demands of a growing customer base.