Learn how Palo Alto Networks is Transforming Platform Engineering with AI Agents. Register here

Attend a Live Product Tour to see Sedai in action.

Register now
More
Close

Autonomous Optimization at Palo Alto Networks

Last updated

November 20, 2024

Published
Topics
Last updated

November 20, 2024

Published
Topics
No items found.

Reduce your cloud costs by 50%, safely

  • Optimize compute, storage and data

  • Choose copilot or autopilot execution

  • Continuously improve with reinforcement learning

CONTENTS

Autonomous Optimization at Palo Alto Networks

This article is based on an edit transcript of a talk at Sedai's autocon/23 conference by Ramesh Nampelly.

Palo Alto Networks has become a leading force in cybersecurity by consistently innovating and optimizing its infrastructure to meet modern challenges. As the company’s product portfolio grows and the demand for reliability increases, it has become imperative to streamline operations, reduce costs Operational Excellence, and enhance performance. This has led to the development of an Autonomous Platform designed to optimize Site Reliability Engineering (SRE) operations and manage cloud infrastructure efficiently.

Recognizing the Challenges of Rapid Cloud Growth

The development of the Autonomous Platform stemmed from several challenges that arose as we rapidly grew over the past few years. This platform has been instrumental in addressing these challenges, particularly in the areas of SRE and operational excellence. While it benefits the entire engineering team, today I’ll focus on its impact on SREs and production operations.

Let’s dive into the challenges that led to the creation of this platform.

Challenges Brought by Rapid Growth

During the pandemic, Palo Alto Networks experienced a 5x growth, which introduced significant operational and infrastructural challenges. This rapid expansion highlighted gaps in our systems and processes, which we needed to address swiftly to maintain the reliability and efficiency of our services.

One of the immediate effects of our rapid growth was the dramatic increase in our cloud spending. As we moved more workloads from data centers to the cloud, costs grew sharply, requiring us to balance financial optimization with maintaining the reliability of our services. Ensuring this balance became a shared responsibility between our SRE and FinOps teams, adding additional pressure on engineering.

With this increased workload, engineers began to experience fatigue. The demands of 24/7 operations, coupled with the sheer scale of our services, led to burnout. It was clear that without a solution, the growing responsibilities would continue to overwhelm the teams responsible for maintaining operational stability.

The Scale of Our Infrastructure

Palo Alto Networks operates in a unique environment compared to other companies. We don't just offer a single SaaS solution but manage a wide array of products and services:

  • We support 34 different products, each with its own technical requirements and customer needs.
  • These products are backed by over 50,000 microservices, which need to be maintained, optimized, and scaled continuously.
  • Our infrastructure spans 3 public cloud platforms, adding complexity to the management of our services.
  • We also manage 6 colocation data centers, which handle various production workloads outside of the public cloud.

This complexity meant that our SRE and operations teams faced an enormous challenge in keeping services running smoothly and efficiently, while also managing costs.

The Impact on SRE and Production Operations

As we scaled, maintaining our service-level agreements (SLAs) with customers became more demanding. Our teams had to ensure high availability for critical services, all while balancing the costs associated with our cloud usage. This was particularly challenging as we handled increasing traffic volumes and workloads across a diverse product suite.

In addition to this, the teams faced growing pressure to collaborate with FinOps to find ways to reduce cloud expenditure without compromising on service reliability. Managing this balance added a new layer of responsibility to the teams already tasked with maintaining operational excellence.

This heavy workload and constant pressure led to burnout among engineers. Working around the clock to support services, many team members struggled to maintain the necessary pace, which further underscored the need for a more efficient, automated approach to managing our infrastructure.

Key Challenges and Goals for Autonomous Platform

As we scaled Palo Alto Networks, one of the core areas we focused on optimizing was our Site Reliability Engineering (SRE) function. The complexity of our environment, combined with rapid growth, exposed several key challenges that our SREs were facing. Addressing these challenges became a priority as they impacted both productivity and operational efficiency.

Let’s walk through the most pressing SRE challenges that the Autonomous Platform is designed to solve.

Addressing the Problem of Toil

The first and perhaps most fundamental challenge is toil. In the context of SRE, toil refers to repetitive tasks that engineers must perform manually again and again. These tasks, often operational in nature, do not add long-term value and can lead to significant stress and burnout among the team. Tasks that could potentially be automated end up being performed manually, which not only wastes valuable time but also causes frustration among engineers who feel like they are unable to contribute to higher-value work.

Toil is a major source of inefficiency, and reducing it is essential for improving the well-being of our SRE teams as well as overall system reliability.

Isolated and Fragmented Tooling

Another significant issue we’ve encountered is the use of isolated, disconnected tools across teams. Engineers often develop tools on an ad-hoc basis to meet immediate needs, but without the typical software development processes—like versioning, CI/CD pipelines, or guardrails. This has led to a "kitchen sink" of tools, many of which aren’t properly maintained or integrated into a cohesive system.

The result is an environment where new engineers find it difficult to navigate and understand the tooling landscape. Furthermore, these fragmented tools can sometimes introduce errors in production, adding another layer of operational risk.

Tool Management Overhead

Over time, managing this growing collection of isolated tools has added considerable overhead. As new tools are added without careful management, their complexity accumulates. This introduces technical debt, where maintaining these tools requires additional effort, draining time and resources from the SRE teams. Without proper governance, what starts as a helpful tool for solving a specific problem can become a liability over time.

Complexity Across Multiple Products

At Palo Alto Networks, we manage over 30 different products, each with its own tech stack, architecture, and unique customer problems to solve. This diversity creates a significant challenge for our SRE teams, as supporting one product often does not translate into expertise in another. An engineer who is proficient in maintaining one product may find themselves starting from scratch when working with a different one, leading to inefficiencies and gaps in operational coverage.

Scaling the SRE Team

Finally, as our customer base and workloads continue to expand, scaling the SRE team linearly is simply not feasible. The rate of growth in our operations far outpaces the ability to hire and onboard new engineers. This means that without a robust platform to help manage the increasing complexity, we risk overloading our existing SRE teams, exacerbating the problems of toil and burnout.

Autonomous Platform

The Autonomous Platform is rooted in a clear vision and mission, designed to revolutionize the way Site Reliability Engineers (SREs) and production-supporting engineers work by leveraging production data in an autonomous manner. This allows organizations to scale their operations without a linear increase in resources, effectively supporting 10x customer growth.

Vision

The Autonomous Platform envisions a future where production data is fully and autonomously utilized to provide "best-in-class SRE support." The platform aims to enable sub-linear growth in resource consumption while supporting 10x customer scale. By automating many routine processes, the platform eliminates manual interventions, allowing engineers to focus on more strategic tasks.

Mission

The platform’s mission is to develop tools that empower SREs and production engineers by providing autonomous capabilities. These capabilities are designed to boost productivity, efficiency, and overall operational quality. By eliminating the repetitive toil often associated with daily operations, the platform helps engineers maintain higher service reliability and quality.

Operational Excellence Goals

To ensure the successful implementation of the platform, four core operational excellence goals were established:

  1. Reduce Mean Time to Detect (MTTD)
    • Golden Signals: Implement real-time monitoring of system health.
    • Anomaly to Incident: Quickly detect anomalies and automatically convert them into incidents before the customer even notices an issue.
    • Proactive Issue Identification: Shift towards proactive problem identification, reducing the potential for customer-facing disruptions.
  2. Reduce Mean Time to Repair (MTTR)
    • Multi-Region Resiliency: Ensure services are resilient across multiple regions, reducing downtime.
    • Automatic Rollbacks: Develop and deploy automatic rollback systems to mitigate risks when issues arise.
    • Auto Remediations: Automate the remediation process to minimize manual interventions, improving resolution times.
  3. Improve Performance
    • Auto Scaling: Introduce dynamic scaling capabilities to handle fluctuating demands without delays.
    • Metrics and Insights: Gather comprehensive metrics and insights from production environments, enabling smarter and autonomous decision-making to improve overall system performance.
    • Sub-Millisecond Experience: Continuously strive to provide a near-zero latency experience for end users.
  4. Manage Costs
    • Cost Attribution: Accurately attribute costs to specific services or operations to gain better visibility into where resources are being spent.
    • Cost Management: Implement robust cost management strategies, ensuring that as services scale, costs remain under control.
    • Costs Analytics: Make cost data accessible at every level of the engineering team, including developers, so that all stakeholders are aware of how their work impacts operational costs.

The Autonomous Platform's core purpose is to help organizations maintain service reliability and performance as they scale, all while managing operational costs. The goals outlined above ensure that organizations are prepared to detect and address issues faster, resolve them efficiently, and provide a seamless experience for end users—all while maintaining tight control over operational costs.

By integrating these capabilities into the platform, SREs, developers, and engineers alike can better understand the impact of their work on infrastructure and costs, ensuring that resources are used optimally.

Architectural Principles and Framework Goals

When designing and building a platform intended to support modern enterprise needs, a set of clear architectural principles and foundational goals is necessary. At Palo Alto Networks, the Autonomous Platform has been built with a focus on providing a resilient, scalable, and modular architecture. Here, we’ll explore the platform’s key goals, approaches, and technology stack, along with the key capabilities of the platform that have been developed to streamline production and operations.

The first step in developing an enterprise-grade platform is establishing clear architectural goals. At Palo Alto Networks, the following goals were prioritized:

  • Framework and Foundation: We needed a robust foundation that allows adding new functionalities with clarity and safety. This framework ensures that engineers can contribute seamlessly.
  • Empower SREs: The platform empowers Site Reliability Engineers (SREs) by enabling them to add or modify functionalities even without in-depth software engineering knowledge. This lowers the barrier to automation, allowing SREs to create workflows without significant coding expertise.
  • Extensibility: The platform must be extensible across products and specific to the needs of each product. With over 30 products at Palo Alto Networks, the platform must support multi-team operations.
  • Multi-Tenancy: Multi-team tenancy is essential. Each product team should have access to their specific platform capabilities while sharing the broader platform’s resources.
  • Resiliency & High Availability: Ensuring the platform supports high availability, disaster recovery (DR), and break-glass modes is crucial to maintaining operational continuity.

Key Architectural Approaches

The architecture of the Autonomous Platform adheres to a modular and loosely coupled design, ensuring flexibility and adaptability across various products. Below are some of the core approaches that guide the platform’s structure:

  • Modular, Loosely Coupled: The platform is built to be modular, making it easier to add, modify, or replace individual components without affecting the entire system.
  • Product-Agnostic Layers: While certain product-specific requirements exist, the core of the platform remains product-agnostic, meaning it can be leveraged by various teams across different products.
  • Buy vs Build: After evaluating existing solutions, the team found that there were no commercial offerings that fully addressed their needs. As a result, the decision was made to build the core platform leveraging open-source frameworks where applicable.
  • Kubernetes Runtime: The platform uses Kubernetes as its runtime environment, allowing scalability and containerized management across multiple cloud platforms. Provisioning is handled using GitOps, ensuring infrastructure as code.
  • Contribution Model: Instead of a centralized team managing all platform development, an inner sourcing or contribution model is adopted. This enables engineers across different teams to contribute and enhance platform functionalities.

Core Technology Stack

The choice of core technologies underpins the platform's architecture, providing essential capabilities for observability, automation, and policy management. Here’s a breakdown of the technologies in use:

  • Grafana Stack: For log, metric, and trace collection, Grafana was selected as the best fit for observability. It supports visualization, querying, and alert generation.
  • StackStorm: For ad-hoc actions and event-driven automation, StackStorm was chosen. This allows SREs to run scripts and workflows in production environments while maintaining auditability.
  • Open Policy Agent (OPA): OPA provides policy evaluation capabilities, ensuring security and compliance across the platform.
  • Backstage: An internal developer portal framework built on micro front-ends, Backstage offers a simplified interface for engineers to interact with platform capabilities. Customized to fit Palo Alto Networks' specific needs, Backstage plays a crucial role in enabling developers to access documentation, templates, and tools.

Platform Capabilities Overview

The Autonomous Platform brings together resource management, infrastructure management, and production management under a unified framework. This integration is achieved via the Developer Portal, which offers:

  • Service Catalogue: Centralized repository for services, simplifying access and deployment.
  • Service Templates & Golden Paths: Predefined workflows and templates to streamline processes and ensure best practices.
  • Documentation as Code: Seamless access to code-based documentation for consistency and efficiency.

Other key capabilities of the platform include:

  • Infrastructure Management: Handling policy management, environment management, and multi-cloud orchestration through cost-efficient means.
  • Resource Management: Managing cloud infrastructure (e.g., AWS, GCP) and network resources.
  • Production Management: Ensuring observability, incident management, and providing the right auto-remediation mechanisms to maintain service reliability.

The Autonomous Platform built by Palo Alto Networks is not only a technical achievement but a forward-thinking solution that combines scalability, extensibility, and ease of use for engineers. With a solid foundation in modular design and a carefully chosen tech stack, it empowers Service Reliabilitys and developers alike to enhance system performance, manage costs, and automate repetitive tasks. The continued evolution of this platform ensures that as enterprise demands grow, the tools to support them will scale efficiently and effectively.

Integration of Sedai for Cost Management

At Palo Alto Networks, cost management has become a critical part of our Autonomous Platform. While much of the platform is developed in-house, we’ve adopted an extensible framework that allows integration with external vendors when it makes sense. This helps us focus engineering efforts where they are most needed while leveraging third-party solutions for specific needs.

Sedai Integration for Cost Optimization

One key area of integration is cost management for serverless and Kubernetes workloads. While open-source tools like OpenCost handle basic cost tracking, optimizing serverless costs presented challenges. After evaluating various solutions, we integrated Sedai to optimize both cost and performance for our serverless operations.

  • Why Sedai?
    • Sedai provides automation and insights that were hard to achieve with open-source tools, helping us reduce costs and improve performance.

We are now extending Sedai to manage Kubernetes workloads, ensuring cost efficiency as we scale.

By integrating Sedai, we've streamlined our cost management approach, allowing us to focus on innovation while keeping operations efficient and cost-effective.

Challenges and Results in Serverless Optimization

When managing serverless environments, we faced several key challenges that demanded constant attention. The dynamic nature of serverless functions, such as AWS Lambda and Google Cloud Functions, made optimizing performance and controlling costs particularly tricky. Here's a quick summary of the main issues:

Serverless Key Challenges

  • Right-sizing functions: Allocating just the right amount of resources for each function without over-provisioning.
  • Managing concurrency: Handling multiple instances without bottlenecks.
  • Warming up functions: Deciding when and how to keep functions warm to avoid delays.
  • Tracing errors: Diagnosing issues in a distributed, stateless environment.
  • Cost control: Keeping costs low as functions scale.

These challenges repeated with each new release, stretching our SRE team’s bandwidth. We needed a solution that could optimize performance while managing costs efficiently.

Why We Chose Sedai for Autonomous Optimization

After reviewing various application performance management (APM) and cost management tools, we concluded that traditional solutions were either too limited or reactive for our platform’s needs. Upon discovering Sedai, we identified it as the best vendor to integrate with our core platform for optimizing cost management, thanks to its autonomous approach. Here's what we found:

Features APM Tools Insight Tools Recommendation Tools Rule-based Static Optimization Autonomous Optimization (Sedai)
Comprehensive optimization Varies Varies
Enhances production safety
Acts in production
Continuous learning

After evaluating several tools, we found that Sedai offered the best solution for optimizing serverless workloads. Traditional APM and cost management tools were either too limited or reactive, but Sedai's autonomous platform provided:

  • Comprehensive optimization: Continuous learning and real-time adjustments.
  • Production safety: Enhancing stability while making real-time optimizations.
  • Cost-effectiveness: Reducing costs without sacrificing performance.

We’ve seen positive results from integrating Sedai and plan to expand its use across our platform to further streamline serverless operations while keeping costs under control.

Challenges and Approach for Kubernetes Optimization

Cloud cost optimization is crucial, particularly with Kubernetes and serverless environments where complexities can stack up quickly. In this section, we explore a practical approach for optimizing costs and improving performance, starting with Kubernetes and moving into serverless functions. This blog will outline key strategies, challenges, and optimization techniques.

Key Strategies for Kubernetes Optimization

When managing Kubernetes clusters, one major task is optimizing resource allocation without impacting performance. Here’s a breakdown of the optimization approach:

  1. Rightsize Workloads:
    • Capabilities: Horizontal and vertical scaling.
    • Impact: Achieves around 20–30% savings by scaling infrastructure up or down based on actual usage.
  2. Rightsize Infrastructure
    • Capabilities: Focus on the types of infrastructure and groups.
    • Impact: Results in 15–25% cost savings by configuring the infrastructure optimally.
  3. Purchase at Lowest Cost:
    • Capabilities: Leveraging on-demand, reserved instances (RIs), and spot optimization.
    • Impact: Potential savings of up to 72–90% through cost-effective purchasing strategies.
  4. Adapt to Traffic Changes:
    • Capabilities: Predictive auto-scaling that adjusts based on incoming traffic.
    • Impact: Ensures performance needs are met at peak times without over-provisioning resources.
  5. Adapt to New Releases:
    • Capabilities: Release intelligence to help adapt to newer Kubernetes versions and changes.
    • Impact: Continuous optimization by tracking the performance of new releases over time.

This approach combines both proactive and reactive strategies, ensuring that both initial and ongoing optimizations are addressed, adapting dynamically to system demands.

Key Challenges in Kubernetes Cost Management

While Kubernetes offers a scalable and flexible infrastructure, there are specific challenges associated with managing its costs:

  • Allocation of Total Costs by Namespace and Tags: Proper tagging is necessary to accurately allocate costs within the clusters, especially when multiple namespaces are involved.
  • Levels of Abstraction: Kubernetes abstracts many aspects of resource allocation, reducing the transparency of cost tracking. This makes it difficult to pinpoint where expenses are accumulating without appropriate monitoring tools.
  • Multi-Cloud Compatibility: Managing Kubernetes across multiple cloud providers (AWS, GCP, Azure) presents additional complexity, as cost structures vary across platforms.
  • From Recommendations to Actions: Many tools offer cost-saving recommendations, but translating these into actionable steps can be a hurdle. Autonomous tools are needed to execute these recommendations.
  • High Volume of Low-Cost Opportunities: Open-source tools often highlight small, low-value cost savings opportunities. Individually, they may seem insignificant, but cumulatively, they can have a significant impact on the overall budget.

Serverless Optimization: Results and Strategies

Serverless architectures also benefit from similar optimization strategies. The following results have been observed:

  • Latency Improvement: Achieved a 22% improvement in latency.
  • Cost Reduction: Observed an 11% overall reduction in cloud costs, thanks to focused optimization efforts.

Serverless cost optimization follows a structured, autonomous approach:

  1. Optimize Memory/CPU:
    • Autonomous Optimization: Automated processes to fine-tune memory and CPU usage.
    • Cost Impact: Reduces costs by around 20%.
  2. Manage Concurrency:
    • Autonomous Concurrency Management: Leverages traffic prediction and concurrency management to ensure efficient scaling.
    • Cost Impact: Adds an additional 10% in cost reduction.
  3. Adapt to New Releases:
    • Release Intelligence: Ensures smooth transitions and optimizations when adopting new service releases or infrastructure upgrades.

This approach helps achieve continuous, automated optimization without requiring constant manual intervention. By addressing the core areas of memory, CPU, and concurrency, businesses can see noticeable improvements in both performance and cost-efficiency.

Cost and Performance Impact

As we embark on our Kubernetes cost optimization journey, early results provide both insight and encouragement. Currently, with a limited number of Kubernetes environments onboarded, we have realized approximately 2% in cost savings. While this may seem modest, it is important to note that we are only in the early stages of this process, and we anticipate significant improvements as we continue.

Key Insights from Current Kubernetes Cost Optimization:

  • Current Savings: We have achieved approximately 2% savings in the Kubernetes environments that have been optimized so far.
  • Pending Opportunities: Based on the recommendations provided, once we fully implement the optimizations across our entire infrastructure, we are expecting potential cost savings of up to 61%.

These early results show promise, and by scaling these strategies across all clusters, we aim to unlock even more substantial savings.

Conclusion

In conclusion, Palo Alto Networks has successfully addressed the challenges brought by rapid growth and cloud infrastructure expansion through the development of its Autonomous Platform. This platform has streamlined SRE operations, reduced costs, and improved performance by automating repetitive tasks and optimizing resource management. By integrating tools like Sedai for serverless and Kubernetes optimization, the company has further enhanced cost efficiency while maintaining high service reliability. As Palo Alto Networks continues to evolve, the Autonomous Platform plays a crucial role in ensuring scalable, resilient operations that meet the demands of a growing customer base.

Was this content helpful?

Thank you for submitting your feedback.
Oops! Something went wrong while submitting the form.

CONTENTS

Autonomous Optimization at Palo Alto Networks

Published on
Last updated on

November 20, 2024

Max 3 min
Autonomous Optimization at Palo Alto Networks

This article is based on an edit transcript of a talk at Sedai's autocon/23 conference by Ramesh Nampelly.

Palo Alto Networks has become a leading force in cybersecurity by consistently innovating and optimizing its infrastructure to meet modern challenges. As the company’s product portfolio grows and the demand for reliability increases, it has become imperative to streamline operations, reduce costs Operational Excellence, and enhance performance. This has led to the development of an Autonomous Platform designed to optimize Site Reliability Engineering (SRE) operations and manage cloud infrastructure efficiently.

Recognizing the Challenges of Rapid Cloud Growth

The development of the Autonomous Platform stemmed from several challenges that arose as we rapidly grew over the past few years. This platform has been instrumental in addressing these challenges, particularly in the areas of SRE and operational excellence. While it benefits the entire engineering team, today I’ll focus on its impact on SREs and production operations.

Let’s dive into the challenges that led to the creation of this platform.

Challenges Brought by Rapid Growth

During the pandemic, Palo Alto Networks experienced a 5x growth, which introduced significant operational and infrastructural challenges. This rapid expansion highlighted gaps in our systems and processes, which we needed to address swiftly to maintain the reliability and efficiency of our services.

One of the immediate effects of our rapid growth was the dramatic increase in our cloud spending. As we moved more workloads from data centers to the cloud, costs grew sharply, requiring us to balance financial optimization with maintaining the reliability of our services. Ensuring this balance became a shared responsibility between our SRE and FinOps teams, adding additional pressure on engineering.

With this increased workload, engineers began to experience fatigue. The demands of 24/7 operations, coupled with the sheer scale of our services, led to burnout. It was clear that without a solution, the growing responsibilities would continue to overwhelm the teams responsible for maintaining operational stability.

The Scale of Our Infrastructure

Palo Alto Networks operates in a unique environment compared to other companies. We don't just offer a single SaaS solution but manage a wide array of products and services:

  • We support 34 different products, each with its own technical requirements and customer needs.
  • These products are backed by over 50,000 microservices, which need to be maintained, optimized, and scaled continuously.
  • Our infrastructure spans 3 public cloud platforms, adding complexity to the management of our services.
  • We also manage 6 colocation data centers, which handle various production workloads outside of the public cloud.

This complexity meant that our SRE and operations teams faced an enormous challenge in keeping services running smoothly and efficiently, while also managing costs.

The Impact on SRE and Production Operations

As we scaled, maintaining our service-level agreements (SLAs) with customers became more demanding. Our teams had to ensure high availability for critical services, all while balancing the costs associated with our cloud usage. This was particularly challenging as we handled increasing traffic volumes and workloads across a diverse product suite.

In addition to this, the teams faced growing pressure to collaborate with FinOps to find ways to reduce cloud expenditure without compromising on service reliability. Managing this balance added a new layer of responsibility to the teams already tasked with maintaining operational excellence.

This heavy workload and constant pressure led to burnout among engineers. Working around the clock to support services, many team members struggled to maintain the necessary pace, which further underscored the need for a more efficient, automated approach to managing our infrastructure.

Key Challenges and Goals for Autonomous Platform

As we scaled Palo Alto Networks, one of the core areas we focused on optimizing was our Site Reliability Engineering (SRE) function. The complexity of our environment, combined with rapid growth, exposed several key challenges that our SREs were facing. Addressing these challenges became a priority as they impacted both productivity and operational efficiency.

Let’s walk through the most pressing SRE challenges that the Autonomous Platform is designed to solve.

Addressing the Problem of Toil

The first and perhaps most fundamental challenge is toil. In the context of SRE, toil refers to repetitive tasks that engineers must perform manually again and again. These tasks, often operational in nature, do not add long-term value and can lead to significant stress and burnout among the team. Tasks that could potentially be automated end up being performed manually, which not only wastes valuable time but also causes frustration among engineers who feel like they are unable to contribute to higher-value work.

Toil is a major source of inefficiency, and reducing it is essential for improving the well-being of our SRE teams as well as overall system reliability.

Isolated and Fragmented Tooling

Another significant issue we’ve encountered is the use of isolated, disconnected tools across teams. Engineers often develop tools on an ad-hoc basis to meet immediate needs, but without the typical software development processes—like versioning, CI/CD pipelines, or guardrails. This has led to a "kitchen sink" of tools, many of which aren’t properly maintained or integrated into a cohesive system.

The result is an environment where new engineers find it difficult to navigate and understand the tooling landscape. Furthermore, these fragmented tools can sometimes introduce errors in production, adding another layer of operational risk.

Tool Management Overhead

Over time, managing this growing collection of isolated tools has added considerable overhead. As new tools are added without careful management, their complexity accumulates. This introduces technical debt, where maintaining these tools requires additional effort, draining time and resources from the SRE teams. Without proper governance, what starts as a helpful tool for solving a specific problem can become a liability over time.

Complexity Across Multiple Products

At Palo Alto Networks, we manage over 30 different products, each with its own tech stack, architecture, and unique customer problems to solve. This diversity creates a significant challenge for our SRE teams, as supporting one product often does not translate into expertise in another. An engineer who is proficient in maintaining one product may find themselves starting from scratch when working with a different one, leading to inefficiencies and gaps in operational coverage.

Scaling the SRE Team

Finally, as our customer base and workloads continue to expand, scaling the SRE team linearly is simply not feasible. The rate of growth in our operations far outpaces the ability to hire and onboard new engineers. This means that without a robust platform to help manage the increasing complexity, we risk overloading our existing SRE teams, exacerbating the problems of toil and burnout.

Autonomous Platform

The Autonomous Platform is rooted in a clear vision and mission, designed to revolutionize the way Site Reliability Engineers (SREs) and production-supporting engineers work by leveraging production data in an autonomous manner. This allows organizations to scale their operations without a linear increase in resources, effectively supporting 10x customer growth.

Vision

The Autonomous Platform envisions a future where production data is fully and autonomously utilized to provide "best-in-class SRE support." The platform aims to enable sub-linear growth in resource consumption while supporting 10x customer scale. By automating many routine processes, the platform eliminates manual interventions, allowing engineers to focus on more strategic tasks.

Mission

The platform’s mission is to develop tools that empower SREs and production engineers by providing autonomous capabilities. These capabilities are designed to boost productivity, efficiency, and overall operational quality. By eliminating the repetitive toil often associated with daily operations, the platform helps engineers maintain higher service reliability and quality.

Operational Excellence Goals

To ensure the successful implementation of the platform, four core operational excellence goals were established:

  1. Reduce Mean Time to Detect (MTTD)
    • Golden Signals: Implement real-time monitoring of system health.
    • Anomaly to Incident: Quickly detect anomalies and automatically convert them into incidents before the customer even notices an issue.
    • Proactive Issue Identification: Shift towards proactive problem identification, reducing the potential for customer-facing disruptions.
  2. Reduce Mean Time to Repair (MTTR)
    • Multi-Region Resiliency: Ensure services are resilient across multiple regions, reducing downtime.
    • Automatic Rollbacks: Develop and deploy automatic rollback systems to mitigate risks when issues arise.
    • Auto Remediations: Automate the remediation process to minimize manual interventions, improving resolution times.
  3. Improve Performance
    • Auto Scaling: Introduce dynamic scaling capabilities to handle fluctuating demands without delays.
    • Metrics and Insights: Gather comprehensive metrics and insights from production environments, enabling smarter and autonomous decision-making to improve overall system performance.
    • Sub-Millisecond Experience: Continuously strive to provide a near-zero latency experience for end users.
  4. Manage Costs
    • Cost Attribution: Accurately attribute costs to specific services or operations to gain better visibility into where resources are being spent.
    • Cost Management: Implement robust cost management strategies, ensuring that as services scale, costs remain under control.
    • Costs Analytics: Make cost data accessible at every level of the engineering team, including developers, so that all stakeholders are aware of how their work impacts operational costs.

The Autonomous Platform's core purpose is to help organizations maintain service reliability and performance as they scale, all while managing operational costs. The goals outlined above ensure that organizations are prepared to detect and address issues faster, resolve them efficiently, and provide a seamless experience for end users—all while maintaining tight control over operational costs.

By integrating these capabilities into the platform, SREs, developers, and engineers alike can better understand the impact of their work on infrastructure and costs, ensuring that resources are used optimally.

Architectural Principles and Framework Goals

When designing and building a platform intended to support modern enterprise needs, a set of clear architectural principles and foundational goals is necessary. At Palo Alto Networks, the Autonomous Platform has been built with a focus on providing a resilient, scalable, and modular architecture. Here, we’ll explore the platform’s key goals, approaches, and technology stack, along with the key capabilities of the platform that have been developed to streamline production and operations.

The first step in developing an enterprise-grade platform is establishing clear architectural goals. At Palo Alto Networks, the following goals were prioritized:

  • Framework and Foundation: We needed a robust foundation that allows adding new functionalities with clarity and safety. This framework ensures that engineers can contribute seamlessly.
  • Empower SREs: The platform empowers Site Reliability Engineers (SREs) by enabling them to add or modify functionalities even without in-depth software engineering knowledge. This lowers the barrier to automation, allowing SREs to create workflows without significant coding expertise.
  • Extensibility: The platform must be extensible across products and specific to the needs of each product. With over 30 products at Palo Alto Networks, the platform must support multi-team operations.
  • Multi-Tenancy: Multi-team tenancy is essential. Each product team should have access to their specific platform capabilities while sharing the broader platform’s resources.
  • Resiliency & High Availability: Ensuring the platform supports high availability, disaster recovery (DR), and break-glass modes is crucial to maintaining operational continuity.

Key Architectural Approaches

The architecture of the Autonomous Platform adheres to a modular and loosely coupled design, ensuring flexibility and adaptability across various products. Below are some of the core approaches that guide the platform’s structure:

  • Modular, Loosely Coupled: The platform is built to be modular, making it easier to add, modify, or replace individual components without affecting the entire system.
  • Product-Agnostic Layers: While certain product-specific requirements exist, the core of the platform remains product-agnostic, meaning it can be leveraged by various teams across different products.
  • Buy vs Build: After evaluating existing solutions, the team found that there were no commercial offerings that fully addressed their needs. As a result, the decision was made to build the core platform leveraging open-source frameworks where applicable.
  • Kubernetes Runtime: The platform uses Kubernetes as its runtime environment, allowing scalability and containerized management across multiple cloud platforms. Provisioning is handled using GitOps, ensuring infrastructure as code.
  • Contribution Model: Instead of a centralized team managing all platform development, an inner sourcing or contribution model is adopted. This enables engineers across different teams to contribute and enhance platform functionalities.

Core Technology Stack

The choice of core technologies underpins the platform's architecture, providing essential capabilities for observability, automation, and policy management. Here’s a breakdown of the technologies in use:

  • Grafana Stack: For log, metric, and trace collection, Grafana was selected as the best fit for observability. It supports visualization, querying, and alert generation.
  • StackStorm: For ad-hoc actions and event-driven automation, StackStorm was chosen. This allows SREs to run scripts and workflows in production environments while maintaining auditability.
  • Open Policy Agent (OPA): OPA provides policy evaluation capabilities, ensuring security and compliance across the platform.
  • Backstage: An internal developer portal framework built on micro front-ends, Backstage offers a simplified interface for engineers to interact with platform capabilities. Customized to fit Palo Alto Networks' specific needs, Backstage plays a crucial role in enabling developers to access documentation, templates, and tools.

Platform Capabilities Overview

The Autonomous Platform brings together resource management, infrastructure management, and production management under a unified framework. This integration is achieved via the Developer Portal, which offers:

  • Service Catalogue: Centralized repository for services, simplifying access and deployment.
  • Service Templates & Golden Paths: Predefined workflows and templates to streamline processes and ensure best practices.
  • Documentation as Code: Seamless access to code-based documentation for consistency and efficiency.

Other key capabilities of the platform include:

  • Infrastructure Management: Handling policy management, environment management, and multi-cloud orchestration through cost-efficient means.
  • Resource Management: Managing cloud infrastructure (e.g., AWS, GCP) and network resources.
  • Production Management: Ensuring observability, incident management, and providing the right auto-remediation mechanisms to maintain service reliability.

The Autonomous Platform built by Palo Alto Networks is not only a technical achievement but a forward-thinking solution that combines scalability, extensibility, and ease of use for engineers. With a solid foundation in modular design and a carefully chosen tech stack, it empowers Service Reliabilitys and developers alike to enhance system performance, manage costs, and automate repetitive tasks. The continued evolution of this platform ensures that as enterprise demands grow, the tools to support them will scale efficiently and effectively.

Integration of Sedai for Cost Management

At Palo Alto Networks, cost management has become a critical part of our Autonomous Platform. While much of the platform is developed in-house, we’ve adopted an extensible framework that allows integration with external vendors when it makes sense. This helps us focus engineering efforts where they are most needed while leveraging third-party solutions for specific needs.

Sedai Integration for Cost Optimization

One key area of integration is cost management for serverless and Kubernetes workloads. While open-source tools like OpenCost handle basic cost tracking, optimizing serverless costs presented challenges. After evaluating various solutions, we integrated Sedai to optimize both cost and performance for our serverless operations.

  • Why Sedai?
    • Sedai provides automation and insights that were hard to achieve with open-source tools, helping us reduce costs and improve performance.

We are now extending Sedai to manage Kubernetes workloads, ensuring cost efficiency as we scale.

By integrating Sedai, we've streamlined our cost management approach, allowing us to focus on innovation while keeping operations efficient and cost-effective.

Challenges and Results in Serverless Optimization

When managing serverless environments, we faced several key challenges that demanded constant attention. The dynamic nature of serverless functions, such as AWS Lambda and Google Cloud Functions, made optimizing performance and controlling costs particularly tricky. Here's a quick summary of the main issues:

Serverless Key Challenges

  • Right-sizing functions: Allocating just the right amount of resources for each function without over-provisioning.
  • Managing concurrency: Handling multiple instances without bottlenecks.
  • Warming up functions: Deciding when and how to keep functions warm to avoid delays.
  • Tracing errors: Diagnosing issues in a distributed, stateless environment.
  • Cost control: Keeping costs low as functions scale.

These challenges repeated with each new release, stretching our SRE team’s bandwidth. We needed a solution that could optimize performance while managing costs efficiently.

Why We Chose Sedai for Autonomous Optimization

After reviewing various application performance management (APM) and cost management tools, we concluded that traditional solutions were either too limited or reactive for our platform’s needs. Upon discovering Sedai, we identified it as the best vendor to integrate with our core platform for optimizing cost management, thanks to its autonomous approach. Here's what we found:

Features APM Tools Insight Tools Recommendation Tools Rule-based Static Optimization Autonomous Optimization (Sedai)
Comprehensive optimization Varies Varies
Enhances production safety
Acts in production
Continuous learning

After evaluating several tools, we found that Sedai offered the best solution for optimizing serverless workloads. Traditional APM and cost management tools were either too limited or reactive, but Sedai's autonomous platform provided:

  • Comprehensive optimization: Continuous learning and real-time adjustments.
  • Production safety: Enhancing stability while making real-time optimizations.
  • Cost-effectiveness: Reducing costs without sacrificing performance.

We’ve seen positive results from integrating Sedai and plan to expand its use across our platform to further streamline serverless operations while keeping costs under control.

Challenges and Approach for Kubernetes Optimization

Cloud cost optimization is crucial, particularly with Kubernetes and serverless environments where complexities can stack up quickly. In this section, we explore a practical approach for optimizing costs and improving performance, starting with Kubernetes and moving into serverless functions. This blog will outline key strategies, challenges, and optimization techniques.

Key Strategies for Kubernetes Optimization

When managing Kubernetes clusters, one major task is optimizing resource allocation without impacting performance. Here’s a breakdown of the optimization approach:

  1. Rightsize Workloads:
    • Capabilities: Horizontal and vertical scaling.
    • Impact: Achieves around 20–30% savings by scaling infrastructure up or down based on actual usage.
  2. Rightsize Infrastructure
    • Capabilities: Focus on the types of infrastructure and groups.
    • Impact: Results in 15–25% cost savings by configuring the infrastructure optimally.
  3. Purchase at Lowest Cost:
    • Capabilities: Leveraging on-demand, reserved instances (RIs), and spot optimization.
    • Impact: Potential savings of up to 72–90% through cost-effective purchasing strategies.
  4. Adapt to Traffic Changes:
    • Capabilities: Predictive auto-scaling that adjusts based on incoming traffic.
    • Impact: Ensures performance needs are met at peak times without over-provisioning resources.
  5. Adapt to New Releases:
    • Capabilities: Release intelligence to help adapt to newer Kubernetes versions and changes.
    • Impact: Continuous optimization by tracking the performance of new releases over time.

This approach combines both proactive and reactive strategies, ensuring that both initial and ongoing optimizations are addressed, adapting dynamically to system demands.

Key Challenges in Kubernetes Cost Management

While Kubernetes offers a scalable and flexible infrastructure, there are specific challenges associated with managing its costs:

  • Allocation of Total Costs by Namespace and Tags: Proper tagging is necessary to accurately allocate costs within the clusters, especially when multiple namespaces are involved.
  • Levels of Abstraction: Kubernetes abstracts many aspects of resource allocation, reducing the transparency of cost tracking. This makes it difficult to pinpoint where expenses are accumulating without appropriate monitoring tools.
  • Multi-Cloud Compatibility: Managing Kubernetes across multiple cloud providers (AWS, GCP, Azure) presents additional complexity, as cost structures vary across platforms.
  • From Recommendations to Actions: Many tools offer cost-saving recommendations, but translating these into actionable steps can be a hurdle. Autonomous tools are needed to execute these recommendations.
  • High Volume of Low-Cost Opportunities: Open-source tools often highlight small, low-value cost savings opportunities. Individually, they may seem insignificant, but cumulatively, they can have a significant impact on the overall budget.

Serverless Optimization: Results and Strategies

Serverless architectures also benefit from similar optimization strategies. The following results have been observed:

  • Latency Improvement: Achieved a 22% improvement in latency.
  • Cost Reduction: Observed an 11% overall reduction in cloud costs, thanks to focused optimization efforts.

Serverless cost optimization follows a structured, autonomous approach:

  1. Optimize Memory/CPU:
    • Autonomous Optimization: Automated processes to fine-tune memory and CPU usage.
    • Cost Impact: Reduces costs by around 20%.
  2. Manage Concurrency:
    • Autonomous Concurrency Management: Leverages traffic prediction and concurrency management to ensure efficient scaling.
    • Cost Impact: Adds an additional 10% in cost reduction.
  3. Adapt to New Releases:
    • Release Intelligence: Ensures smooth transitions and optimizations when adopting new service releases or infrastructure upgrades.

This approach helps achieve continuous, automated optimization without requiring constant manual intervention. By addressing the core areas of memory, CPU, and concurrency, businesses can see noticeable improvements in both performance and cost-efficiency.

Cost and Performance Impact

As we embark on our Kubernetes cost optimization journey, early results provide both insight and encouragement. Currently, with a limited number of Kubernetes environments onboarded, we have realized approximately 2% in cost savings. While this may seem modest, it is important to note that we are only in the early stages of this process, and we anticipate significant improvements as we continue.

Key Insights from Current Kubernetes Cost Optimization:

  • Current Savings: We have achieved approximately 2% savings in the Kubernetes environments that have been optimized so far.
  • Pending Opportunities: Based on the recommendations provided, once we fully implement the optimizations across our entire infrastructure, we are expecting potential cost savings of up to 61%.

These early results show promise, and by scaling these strategies across all clusters, we aim to unlock even more substantial savings.

Conclusion

In conclusion, Palo Alto Networks has successfully addressed the challenges brought by rapid growth and cloud infrastructure expansion through the development of its Autonomous Platform. This platform has streamlined SRE operations, reduced costs, and improved performance by automating repetitive tasks and optimizing resource management. By integrating tools like Sedai for serverless and Kubernetes optimization, the company has further enhanced cost efficiency while maintaining high service reliability. As Palo Alto Networks continues to evolve, the Autonomous Platform plays a crucial role in ensuring scalable, resilient operations that meet the demands of a growing customer base.

Was this content helpful?

Thank you for submitting your feedback.
Oops! Something went wrong while submitting the form.