Frequently Asked Questions

Autonomous Platform & SRE Operations at Palo Alto Networks

What challenges did Palo Alto Networks face during its rapid cloud growth?

Palo Alto Networks experienced a 5x growth during the pandemic, which led to significant operational and infrastructural challenges. These included a dramatic increase in cloud spending, the need to balance financial optimization with service reliability, and increased pressure on SRE and FinOps teams. The scale of operations, with over 34 products and 50,000+ microservices across three public clouds and six colocation data centers, added complexity and contributed to engineer fatigue and burnout.

How did the Autonomous Platform help address SRE challenges at Palo Alto Networks?

The Autonomous Platform was designed to reduce toil, eliminate fragmented tooling, and manage complexity across multiple products. It automates repetitive tasks, provides a modular and extensible architecture, and empowers SREs to focus on higher-value work. The platform supports multi-tenancy, high availability, and disaster recovery, enabling scalable and resilient operations.

What are the core operational excellence goals of the Autonomous Platform?

The platform's operational excellence goals include reducing Mean Time to Detect (MTTD) and Mean Time to Repair (MTTR), improving performance through auto-scaling and metrics-driven insights, and managing costs with accurate attribution and analytics. These goals ensure proactive issue identification, multi-region resiliency, automatic rollbacks, and cost management at every engineering level.

How does the Autonomous Platform empower SREs and engineers?

The platform enables SREs to add or modify functionalities without deep software engineering expertise, thanks to its modular, extensible design. It provides a developer portal with a service catalogue, service templates, and documentation as code, streamlining workflows and lowering the barrier to automation for engineers across teams.

What technology stack supports the Autonomous Platform at Palo Alto Networks?

The platform leverages Grafana for observability, StackStorm for event-driven automation, Open Policy Agent (OPA) for policy evaluation, and Backstage as an internal developer portal. Kubernetes is used as the runtime environment, with GitOps for infrastructure provisioning, ensuring scalability and containerized management across multiple clouds.

How does the Autonomous Platform manage cost attribution and analytics?

The platform implements robust cost management strategies, including accurate cost attribution to services and operations, cost analytics accessible to all engineering levels, and integration with external tools like Sedai for advanced cost optimization. This ensures visibility and control over cloud spending as services scale.

What is the role of modular and loosely coupled architecture in the Autonomous Platform?

The modular, loosely coupled architecture allows for easy addition, modification, or replacement of platform components without disrupting the entire system. This design supports extensibility across over 30 products, multi-team operations, and product-agnostic core layers, enhancing flexibility and adaptability.

How does the Autonomous Platform support multi-tenancy and high availability?

The platform is designed for multi-tenancy, allowing each product team to access specific capabilities while sharing broader resources. It also ensures high availability, disaster recovery, and break-glass modes, maintaining operational continuity and resilience across the organization.

What are the main challenges in serverless optimization at Palo Alto Networks?

Key challenges include right-sizing functions, managing concurrency, warming up functions to avoid delays, tracing errors in distributed environments, and controlling costs as functions scale. These issues required a solution that could optimize performance and manage costs efficiently, leading to the integration of Sedai.

How did Palo Alto Networks optimize Kubernetes workloads for cost and performance?

The optimization approach included right-sizing workloads (horizontal and vertical scaling), optimizing infrastructure, leveraging cost-effective purchasing strategies (on-demand, reserved instances, spot optimization), predictive auto-scaling, and release intelligence. Early results showed 2% savings in initial environments, with potential for up to 61% savings as optimizations are scaled.

What were the results of serverless optimization at Palo Alto Networks?

Serverless optimization led to a 22% improvement in latency and an 11% overall reduction in cloud costs. Autonomous optimization of memory, CPU, and concurrency, along with release intelligence, enabled continuous improvements without constant manual intervention.

Why did Palo Alto Networks choose Sedai for autonomous optimization?

Palo Alto Networks selected Sedai after evaluating several tools because Sedai provided comprehensive, autonomous optimization for serverless and Kubernetes workloads. Sedai's platform offered continuous learning, real-time adjustments, production safety, and cost-effectiveness, outperforming traditional APM and cost management tools that were limited or reactive.

How does Sedai compare to APM, insight, and recommendation tools?

According to the comparison table in the article, Sedai's autonomous optimization provides comprehensive optimization, enhances production safety, acts in production, and supports continuous learning. In contrast, APM and insight tools lack these capabilities, and recommendation or rule-based tools offer only partial or reactive solutions.

What are the key strategies for Kubernetes cost optimization at Palo Alto Networks?

Key strategies include right-sizing workloads and infrastructure, leveraging cost-effective purchasing (on-demand, reserved, spot), predictive auto-scaling, and using release intelligence to adapt to new Kubernetes versions. These strategies aim for both immediate and ongoing optimization, with potential savings up to 61% as implementation scales.

What challenges are unique to Kubernetes cost management?

Unique challenges include accurate cost allocation by namespace and tags, reduced transparency due to Kubernetes abstraction, multi-cloud compatibility, translating recommendations into actions, and managing a high volume of low-cost opportunities. Autonomous tools like Sedai help address these challenges by executing optimizations automatically.

How does the Autonomous Platform ensure production safety during optimization?

The platform incorporates automatic rollbacks, multi-region resiliency, and auditability through tools like StackStorm and OPA. These features ensure that optimizations can be reversed if needed and that production safety is maintained during real-time changes.

What is the impact of integrating Sedai on Palo Alto Networks' operations?

Integrating Sedai has streamlined cost management for serverless and Kubernetes workloads, reduced operational overhead, and enabled the SRE team to focus on innovation. The autonomous optimization provided by Sedai has resulted in measurable cost savings and performance improvements.

How does the Autonomous Platform support continuous improvement and learning?

The platform leverages continuous learning from production data, real-time monitoring, and autonomous decision-making to adapt and optimize operations. Sedai's integration further enhances this by providing continuous, autonomous optimization based on application behavior and traffic patterns.

What is Sedai and what does it offer?

Sedai is an autonomous cloud management platform that optimizes cloud resources for cost, performance, and availability using machine learning. It eliminates manual intervention, reduces cloud costs by up to 50%, improves performance by reducing latency by up to 75%, and enhances reliability by proactively resolving issues. Sedai supports AWS, Azure, GCP, and Kubernetes environments. Learn more.

What are the key features of Sedai's autonomous optimization platform?

Sedai's platform offers autonomous optimization, proactive issue resolution, full-stack cloud coverage, smart SLOs, release intelligence, plug-and-play implementation, multiple modes of operation (Datapilot, Copilot, Autopilot), enhanced productivity, and safety-by-design. These features help reduce costs, improve performance, and ensure reliability. See solution briefs.

How does Sedai help reduce cloud costs?

Sedai reduces cloud costs by up to 50% through autonomous optimization, rightsizing workloads, and eliminating waste. For example, Palo Alto Networks saved $3.5 million, and KnowBe4 achieved 50% cost savings in production by using Sedai. Read the case study.

What business impact can customers expect from using Sedai?

Customers can expect significant cost savings (up to 50%), performance improvements (up to 75% latency reduction), operational efficiency (up to 6X productivity gains), reduced failed customer interactions (up to 50%), and enhanced reliability. These outcomes are supported by real-world case studies from companies like Palo Alto Networks, KnowBe4, and Belcorp. See resources.

How does Sedai compare to other cloud optimization tools?

Sedai stands out with 100% autonomous optimization, proactive issue resolution, application-aware intelligence, full-stack cloud coverage, release intelligence, and quick plug-and-play implementation. Unlike traditional tools that rely on static rules or manual adjustments, Sedai continuously learns and optimizes based on real application behavior. Learn more.

What pain points does Sedai address for SRE and engineering teams?

Sedai addresses pain points such as operational toil, fragmented tooling, manual optimization, balancing cost and reliability, and managing complexity in multi-cloud environments. It automates routine tasks, aligns engineering and cost efficiency goals, and provides actionable insights for continuous improvement. See solution briefs.

Who can benefit from using Sedai?

Sedai is designed for platform engineering, IT/cloud operations, technology leadership, site reliability engineering (SRE), and FinOps professionals. It is ideal for organizations with significant cloud operations across industries such as cybersecurity, IT, financial services, healthcare, travel, and e-commerce. See case studies.

What are some real-world success stories with Sedai?

Notable success stories include Palo Alto Networks saving $3.5 million and reducing Kubernetes costs by 46%, KnowBe4 achieving 50% cost savings, Belcorp reducing AWS Lambda latency by 77%, and Freshworks improving release quality. Read more case studies.

How easy is it to implement Sedai?

Sedai offers a quick setup process: 5 minutes for general use cases and up to 15 minutes for scenarios like AWS Lambda. It features agentless integration via IAM, personalized onboarding, and extensive documentation. A 30-day free trial is available for risk-free evaluation. Get started.

What integrations does Sedai support?

Sedai integrates with monitoring and APM tools (Cloudwatch, Prometheus, Datadog, Azure Monitor), Kubernetes autoscalers (HPA/VPA, Karpenter), IaC and CI/CD tools (GitLab, GitHub, Bitbucket, Terraform), ITSM (ServiceNow, Jira), notification tools (Slack, Microsoft Teams), and runbook automation platforms. Learn more.

What security and compliance certifications does Sedai have?

Sedai is SOC 2 certified, demonstrating adherence to stringent security and compliance standards for data protection. See security details.

Where can I find technical documentation for Sedai?

Comprehensive technical documentation is available at docs.sedai.io/get-started. Additional resources, including case studies and datasheets, can be found on the resources page.

What industries use Sedai?

Sedai is used in industries such as cybersecurity (Palo Alto Networks), IT (HP), financial services (Experian, CapitalOne Bank), security awareness training (KnowBe4), travel (Expedia), healthcare (GSK), car rental (Avis), retail/e-commerce (Belcorp), SaaS (Freshworks), and digital commerce (Campspot). See all case studies.

Who are some of Sedai's notable customers?

Notable customers include Palo Alto Networks, HP, Experian, KnowBe4, Expedia, CapitalOne Bank, GSK, and Avis. These organizations trust Sedai to optimize their cloud environments and improve operational efficiency.

What feedback have customers given about Sedai's ease of use?

Customers appreciate Sedai's quick plug-and-play setup (5–15 minutes), agentless integration, personalized onboarding, detailed documentation, and risk-free 30-day trial. These features contribute to positive feedback regarding the platform's simplicity and efficiency. Learn more.

Sedai Logo

Autonomous Optimization at Palo Alto Networks

S

Sedai

Content Writer

September 24, 2024

Autonomous Optimization at Palo Alto Networks

Featured

This article is based on an edit transcript of a talk at Sedai's autocon/23 conference by Ramesh Nampelly.

Palo Alto Networks has become a leading force in cybersecurity by consistently innovating and optimizing its infrastructure to meet modern challenges. As the company’s product portfolio grows and the demand for reliability increases, it has become imperative to streamline operations, reduce costs Operational Excellence, and enhance performance. This has led to the development of an Autonomous Platform designed to optimize Site Reliability Engineering (SRE) operations and manage cloud infrastructure efficiently.

Recognizing the Challenges of Rapid Cloud Growth

The development of the Autonomous Platform stemmed from several challenges that arose as we rapidly grew over the past few years. This platform has been instrumental in addressing these challenges, particularly in the areas of SRE and operational excellence. While it benefits the entire engineering team, today I’ll focus on its impact on SREs and production operations.

Let’s dive into the challenges that led to the creation of this platform.

Challenges Brought by Rapid Growth

66fad55ac90f9ed1559d46d7_AD_4nXd4ceyAs6zeo9lmcB6ecv1E7ZesUHa8LmNktjRfQ8QyQM6UU1S6Coiu9YweQHlEDw-bDHHKQCtv1pcOryGi4r_x1xiG6fQLuN8bH2TQ0q_ocapeKzD0pO-RSakRu_QkC5VgggAyqYmZIp5v9ruthYWA_1ZN.webp

During the pandemic, Palo Alto Networks experienced a 5x growth, which introduced significant operational and infrastructural challenges. This rapid expansion highlighted gaps in our systems and processes, which we needed to address swiftly to maintain the reliability and efficiency of our services.

One of the immediate effects of our rapid growth was the dramatic increase in our cloud spending. As we moved more workloads from data centers to the cloud, costs grew sharply, requiring us to balance financial optimization with maintaining the reliability of our services. Ensuring this balance became a shared responsibility between our SRE and FinOps teams, adding additional pressure on engineering.

With this increased workload, engineers began to experience fatigue. The demands of 24/7 operations, coupled with the sheer scale of our services, led to burnout. It was clear that without a solution, the growing responsibilities would continue to overwhelm the teams responsible for maintaining operational stability.

The Scale of Our Infrastructure

66fad55b2480fa81e9324e69_AD_4nXfrfT9LnZyd9IsBt9XQdEMAdvD0QSdM-gyPm6RLqAUZEBmM0stPQUlyyK16ajtyVSmnVw7fQCnzcH3pPwqy5VJZINO5srgbSgYSn6u_b6Qu8EQn30TO00GYOv_vTVs7A2PnXD509w-IQKOVzK4DnmbpERj6.webp

Palo Alto Networks operates in a unique environment compared to other companies. We don't just offer a single SaaS solution but manage a wide array of products and services:

  • We support 34 different products, each with its own technical requirements and customer needs.
  • These products are backed by over 50,000 microservices, which need to be maintained, optimized, and scaled continuously.
  • Our infrastructure spans 3 public cloud platforms, adding complexity to the management of our services.
  • We also manage 6 colocation data centers, which handle various production workloads outside of the public cloud.

This complexity meant that our SRE and operations teams faced an enormous challenge in keeping services running smoothly and efficiently, while also managing costs.

The Impact on SRE and Production Operations

As we scaled, maintaining our service-level agreements (SLAs) with customers became more demanding. Our teams had to ensure high availability for critical services, all while balancing the costs associated with our cloud usage. This was particularly challenging as we handled increasing traffic volumes and workloads across a diverse product suite.

In addition to this, the teams faced growing pressure to collaborate with FinOps to find ways to reduce cloud expenditure without compromising on service reliability. Managing this balance added a new layer of responsibility to the teams already tasked with maintaining operational excellence.

This heavy workload and constant pressure led to burnout among engineers. Working around the clock to support services, many team members struggled to maintain the necessary pace, which further underscored the need for a more efficient, automated approach to managing our infrastructure.

Key Challenges and Goals for Autonomous Platform

As we scaled Palo Alto Networks, one of the core areas we focused on optimizing was our Site Reliability Engineering (SRE) function. The complexity of our environment, combined with rapid growth, exposed several key challenges that our SREs were facing. Addressing these challenges became a priority as they impacted both productivity and operational efficiency.

Let’s walk through the most pressing SRE challenges that the Autonomous Platform is designed to solve.

Addressing the Problem of Toil

The first and perhaps most fundamental challenge is toil. In the context of SRE, toil refers to repetitive tasks that engineers must perform manually again and again. These tasks, often operational in nature, do not add long-term value and can lead to significant stress and burnout among the team. Tasks that could potentially be automated end up being performed manually, which not only wastes valuable time but also causes frustration among engineers who feel like they are unable to contribute to higher-value work.

Toil is a major source of inefficiency, and reducing it is essential for improving the well-being of our SRE teams as well as overall system reliability.

Isolated and Fragmented Tooling

Another significant issue we’ve encountered is the use of isolated, disconnected tools across teams. Engineers often develop tools on an ad-hoc basis to meet immediate needs, but without the typical software development processes—like versioning, CI/CD pipelines, or guardrails. This has led to a "kitchen sink" of tools, many of which aren’t properly maintained or integrated into a cohesive system.

The result is an environment where new engineers find it difficult to navigate and understand the tooling landscape. Furthermore, these fragmented tools can sometimes introduce errors in production, adding another layer of operational risk.

Tool Management Overhead

Over time, managing this growing collection of isolated tools has added considerable overhead. As new tools are added without careful management, their complexity accumulates. This introduces technical debt, where maintaining these tools requires additional effort, draining time and resources from the SRE teams. Without proper governance, what starts as a helpful tool for solving a specific problem can become a liability over time.

Complexity Across Multiple Products

At Palo Alto Networks, we manage over 30 different products, each with its own tech stack, architecture, and unique customer problems to solve. This diversity creates a significant challenge for our SRE teams, as supporting one product often does not translate into expertise in another. An engineer who is proficient in maintaining one product may find themselves starting from scratch when working with a different one, leading to inefficiencies and gaps in operational coverage.

Scaling the SRE Team

Finally, as our customer base and workloads continue to expand, scaling the SRE team linearly is simply not feasible. The rate of growth in our operations far outpaces the ability to hire and onboard new engineers. This means that without a robust platform to help manage the increasing complexity, we risk overloading our existing SRE teams, exacerbating the problems of toil and burnout.

Autonomous Platform

The Autonomous Platform is rooted in a clear vision and mission, designed to revolutionize the way Site Reliability Engineers (SREs) and production-supporting engineers work by leveraging production data in an autonomous manner. This allows organizations to scale their operations without a linear increase in resources, effectively supporting 10x customer growth.

Vision

The Autonomous Platform envisions a future where production data is fully and autonomously utilized to provide "best-in-class SRE support." The platform aims to enable sub-linear growth in resource consumption while supporting 10x customer scale. By automating many routine processes, the platform eliminates manual interventions, allowing engineers to focus on more strategic tasks.

Mission

The platform’s mission is to develop tools that empower SREs and production engineers by providing autonomous capabilities. These capabilities are designed to boost productivity, efficiency, and overall operational quality. By eliminating the repetitive toil often associated with daily operations, the platform helps engineers maintain higher service reliability and quality.

Operational Excellence Goals

To ensure the successful implementation of the platform, four core operational excellence goals were established:

  1. Reduce Mean Time to Detect (MTTD)Golden Signals: Implement real-time monitoring of system health.Anomaly to Incident: Quickly detect anomalies and automatically convert them into incidents before the customer even notices an issue.Proactive Issue Identification: Shift towards proactive problem identification, reducing the potential for customer-facing disruptions.
  2. Golden Signals: Implement real-time monitoring of system health.
  3. Anomaly to Incident: Quickly detect anomalies and automatically convert them into incidents before the customer even notices an issue.
  4. Proactive Issue Identification: Shift towards proactive problem identification, reducing the potential for customer-facing disruptions.
  5. Reduce Mean Time to Repair (MTTR)Multi-Region Resiliency: Ensure services are resilient across multiple regions, reducing downtime.Automatic Rollbacks: Develop and deploy automatic rollback systems to mitigate risks when issues arise.Auto Remediations: Automate the remediation process to minimize manual interventions, improving resolution times.
  6. Multi-Region Resiliency: Ensure services are resilient across multiple regions, reducing downtime.
  7. Automatic Rollbacks: Develop and deploy automatic rollback systems to mitigate risks when issues arise.
  8. Auto Remediations: Automate the remediation process to minimize manual interventions, improving resolution times.
  9. Improve PerformanceAuto Scaling: Introduce dynamic scaling capabilities to handle fluctuating demands without delays.Metrics and Insights: Gather comprehensive metrics and insights from production environments, enabling smarter and autonomous decision-making to improve overall system performance.Sub-Millisecond Experience: Continuously strive to provide a near-zero latency experience for end users.
  10. Auto Scaling: Introduce dynamic scaling capabilities to handle fluctuating demands without delays.
  11. Metrics and Insights: Gather comprehensive metrics and insights from production environments, enabling smarter and autonomous decision-making to improve overall system performance.
  12. Sub-Millisecond Experience: Continuously strive to provide a near-zero latency experience for end users.
  13. Manage CostsCost Attribution: Accurately attribute costs to specific services or operations to gain better visibility into where resources are being spent.Cost Management: Implement robust cost management strategies, ensuring that as services scale, costs remain under control.Costs Analytics: Make cost data accessible at every level of the engineering team, including developers, so that all stakeholders are aware of how their work impacts operational costs.
  14. Cost Attribution: Accurately attribute costs to specific services or operations to gain better visibility into where resources are being spent.
  15. Cost Management: Implement robust cost management strategies, ensuring that as services scale, costs remain under control.
  16. Costs Analytics: Make cost data accessible at every level of the engineering team, including developers, so that all stakeholders are aware of how their work impacts operational costs.

The Autonomous Platform's core purpose is to help organizations maintain service reliability and performance as they scale, all while managing operational costs. The goals outlined above ensure that organizations are prepared to detect and address issues faster, resolve them efficiently, and provide a seamless experience for end users—all while maintaining tight control over operational costs.

By integrating these capabilities into the platform, SREs, developers, and engineers alike can better understand the impact of their work on infrastructure and costs, ensuring that resources are used optimally.

Architectural Principles and Framework Goals

When designing and building a platform intended to support modern enterprise needs, a set of clear architectural principles and foundational goals is necessary. At Palo Alto Networks, the Autonomous Platform has been built with a focus on providing a resilient, scalable, and modular architecture. Here, we’ll explore the platform’s key goals, approaches, and technology stack, along with the key capabilities of the platform that have been developed to streamline production and operations.

The first step in developing an enterprise-grade platform is establishing clear architectural goals. At Palo Alto Networks, the following goals were prioritized:

  • Framework and Foundation: We needed a robust foundation that allows adding new functionalities with clarity and safety. This framework ensures that engineers can contribute seamlessly.
  • Empower SREs: The platform empowers Site Reliability Engineers (SREs) by enabling them to add or modify functionalities even without in-depth software engineering knowledge. This lowers the barrier to automation, allowing SREs to create workflows without significant coding expertise.
  • Extensibility: The platform must be extensible across products and specific to the needs of each product. With over 30 products at Palo Alto Networks, the platform must support multi-team operations.
  • Multi-Tenancy: Multi-team tenancy is essential. Each product team should have access to their specific platform capabilities while sharing the broader platform’s resources.
  • Resiliency & High Availability: Ensuring the platform supports high availability, disaster recovery (DR), and break-glass modes is crucial to maintaining operational continuity.

Key Architectural Approaches

The architecture of the Autonomous Platform adheres to a modular and loosely coupled design, ensuring flexibility and adaptability across various products. Below are some of the core approaches that guide the platform’s structure:

  • Modular, Loosely Coupled: The platform is built to be modular, making it easier to add, modify, or replace individual components without affecting the entire system.
  • Product-Agnostic Layers: While certain product-specific requirements exist, the core of the platform remains product-agnostic, meaning it can be leveraged by various teams across different products.
  • Buy vs Build: After evaluating existing solutions, the team found that there were no commercial offerings that fully addressed their needs. As a result, the decision was made to build the core platform leveraging open-source frameworks where applicable.
  • Kubernetes Runtime: The platform uses Kubernetes as its runtime environment, allowing scalability and containerized management across multiple cloud platforms. Provisioning is handled using GitOps, ensuring infrastructure as code.
  • Contribution Model: Instead of a centralized team managing all platform development, an inner sourcing or contribution model is adopted. This enables engineers across different teams to contribute and enhance platform functionalities.

Core Technology Stack

The choice of core technologies underpins the platform's architecture, providing essential capabilities for observability, automation, and policy management. Here’s a breakdown of the technologies in use:

  • Grafana Stack: For log, metric, and trace collection, Grafana was selected as the best fit for observability. It supports visualization, querying, and alert generation.
  • StackStorm: For ad-hoc actions and event-driven automation, StackStorm was chosen. This allows SREs to run scripts and workflows in production environments while maintaining auditability.
  • Open Policy Agent (OPA): OPA provides policy evaluation capabilities, ensuring security and compliance across the platform.
  • Backstage: An internal developer portal framework built on micro front-ends, Backstage offers a simplified interface for engineers to interact with platform capabilities. Customized to fit Palo Alto Networks' specific needs, Backstage plays a crucial role in enabling developers to access documentation, templates, and tools.

Platform Capabilities Overview

66fad55a6980d325e99f812d_AD_4nXeBNGhGaOjBkPscv8co-dG_k2qCjsLpKl_Pob2-tWKQ8K9tTNMvlasbVMpx-BbUOSg0AWQRYzUPvP7eER2ToGeaJXdCaDB1UgEMwwGL7TCzKLvGYHLcSIda0ARkx17JBlnk0BK6zKX9rUBaWQ34pmAwZib9.webp

The Autonomous Platform brings together resource management, infrastructure management, and production management under a unified framework. This integration is achieved via the Developer Portal, which offers:

  • Service Catalogue: Centralized repository for services, simplifying access and deployment.
  • Service Templates & Golden Paths: Predefined workflows and templates to streamline processes and ensure best practices.
  • Documentation as Code: Seamless access to code-based documentation for consistency and efficiency.

Other key capabilities of the platform include:

  • Infrastructure Management: Handling policy management, environment management, and multi-cloud orchestration through cost-efficient means.
  • Resource Management: Managing cloud infrastructure (e.g., AWS, GCP) and network resources.
  • Production Management: Ensuring observability, incident management, and providing the right auto-remediation mechanisms to maintain service reliability.

The Autonomous Platform built by Palo Alto Networks is not only a technical achievement but a forward-thinking solution that combines scalability, extensibility, and ease of use for engineers. With a solid foundation in modular design and a carefully chosen tech stack, it empowers Service Reliabilitys and developers alike to enhance system performance, manage costs, and automate repetitive tasks. The continued evolution of this platform ensures that as enterprise demands grow, the tools to support them will scale efficiently and effectively.

Integration of Sedai for Cost Management

At Palo Alto Networks, cost management has become a critical part of our Autonomous Platform. While much of the platform is developed in-house, we’ve adopted an extensible framework that allows integration with external vendors when it makes sense. This helps us focus engineering efforts where they are most needed while leveraging third-party solutions for specific needs.

Sedai Integration for Cost Optimization

One key area of integration is cost management for serverless and Kubernetes workloads. While open-source tools like OpenCost handle basic cost tracking, optimizing serverless costs presented challenges. After evaluating various solutions, we integrated Sedai to optimize both cost and performance for our serverless operations.

  • Why Sedai?Sedai provides automation and insights that were hard to achieve with open-source tools, helping us reduce costs and improve performance.
  • Sedai provides automation and insights that were hard to achieve with open-source tools, helping us reduce costs and improve performance.

We are now extending Sedai to manage Kubernetes workloads, ensuring cost efficiency as we scale.

By integrating Sedai, we've streamlined our cost management approach, allowing us to focus on innovation while keeping operations efficient and cost-effective.

Challenges and Results in Serverless Optimization

When managing serverless environments, we faced several key challenges that demanded constant attention. The dynamic nature of serverless functions, such as AWS Lambda and Google Cloud Functions, made optimizing performance and controlling costs particularly tricky. Here's a quick summary of the main issues:

Serverless Key Challenges

  • Right-sizing functions: Allocating just the right amount of resources for each function without over-provisioning.
  • Managing concurrency: Handling multiple instances without bottlenecks.
  • Warming up functions: Deciding when and how to keep functions warm to avoid delays.
  • Tracing errors: Diagnosing issues in a distributed, stateless environment.
  • Cost control: Keeping costs low as functions scale.

These challenges repeated with each new release, stretching our SRE team’s bandwidth. We needed a solution that could optimize performance while managing costs efficiently.

Why We Chose Sedai for Autonomous Optimization

After reviewing various application performance management (APM) and cost management tools, we concluded that traditional solutions were either too limited or reactive for our platform’s needs. Upon discovering Sedai, we identified it as the best vendor to integrate with our core platform for optimizing cost management, thanks to its autonomous approach. Here's what we found:

After evaluating several tools, we found that Sedai offered the best solution for optimizing serverless workloads. Traditional APM and cost management tools were either too limited or reactive, but Sedai's autonomous platform provided:

  • Comprehensive optimization: Continuous learning and real-time adjustments.
  • Production safety: Enhancing stability while making real-time optimizations.
  • Cost-effectiveness: Reducing costs without sacrificing performance.

We’ve seen positive results from integrating Sedai and plan to expand its use across our platform to further streamline serverless operations while keeping costs under control.

Challenges and Approach for Kubernetes Optimization

Cloud cost optimization is crucial, particularly with Kubernetes and serverless environments where complexities can stack up quickly. In this section, we explore a practical approach for optimizing costs and improving performance, starting with Kubernetes and moving into serverless functions. This blog will outline key strategies, challenges, and optimization techniques.

Key Strategies for Kubernetes Optimization

When managing Kubernetes clusters, one major task is optimizing resource allocation without impacting performance. Here’s a breakdown of the optimization approach:

  1. Rightsize Workloads:Capabilities: Horizontal and vertical scaling.Impact: Achieves around 20–30% savings by scaling infrastructure up or down based on actual usage.
  2. Capabilities: Horizontal and vertical scaling.
  3. Impact: Achieves around 20–30% savings by scaling infrastructure up or down based on actual usage.
  4. Rightsize InfrastructureCapabilities: Focus on the types of infrastructure and groups.Impact: Results in 15–25% cost savings by configuring the infrastructure optimally.
  5. Capabilities: Focus on the types of infrastructure and groups.
  6. Impact: Results in 15–25% cost savings by configuring the infrastructure optimally.
  7. Purchase at Lowest Cost:Capabilities: Leveraging on-demand, reserved instances (RIs), and spot optimization.Impact: Potential savings of up to 72–90% through cost-effective purchasing strategies.
  8. Capabilities: Leveraging on-demand, reserved instances (RIs), and spot optimization.
  9. Impact: Potential savings of up to 72–90% through cost-effective purchasing strategies.
  10. Adapt to Traffic Changes:Capabilities: Predictive auto-scaling that adjusts based on incoming traffic.Impact: Ensures performance needs are met at peak times without over-provisioning resources.
  11. Capabilities: Predictive auto-scaling that adjusts based on incoming traffic.
  12. Impact: Ensures performance needs are met at peak times without over-provisioning resources.
  13. Adapt to New Releases:Capabilities: Release intelligence to help adapt to newer Kubernetes versions and changes.Impact: Continuous optimization by tracking the performance of new releases over time.
  14. Capabilities: Release intelligence to help adapt to newer Kubernetes versions and changes.
  15. Impact: Continuous optimization by tracking the performance of new releases over time.

This approach combines both proactive and reactive strategies, ensuring that both initial and ongoing optimizations are addressed, adapting dynamically to system demands.

Key Challenges in Kubernetes Cost Management

While Kubernetes offers a scalable and flexible infrastructure, there are specific challenges associated with managing its costs:

  • Allocation of Total Costs by Namespace and Tags: Proper tagging is necessary to accurately allocate costs within the clusters, especially when multiple namespaces are involved.
  • Levels of Abstraction: Kubernetes abstracts many aspects of resource allocation, reducing the transparency of cost tracking. This makes it difficult to pinpoint where expenses are accumulating without appropriate monitoring tools.
  • Multi-Cloud Compatibility: Managing Kubernetes across multiple cloud providers (AWS, GCP, Azure) presents additional complexity, as cost structures vary across platforms.
  • From Recommendations to Actions: Many tools offer cost-saving recommendations, but translating these into actionable steps can be a hurdle. Autonomous tools are needed to execute these recommendations.
  • High Volume of Low-Cost Opportunities: Open-source tools often highlight small, low-value cost savings opportunities. Individually, they may seem insignificant, but cumulatively, they can have a significant impact on the overall budget.

Serverless Optimization: Results and Strategies

Serverless architectures also benefit from similar optimization strategies. The following results have been observed:

  • Latency Improvement: Achieved a 22% improvement in latency.
  • Cost Reduction: Observed an 11% overall reduction in cloud costs, thanks to focused optimization efforts.

Serverless cost optimization follows a structured, autonomous approach:

  1. Optimize Memory/CPU:Autonomous Optimization: Automated processes to fine-tune memory and CPU usage.Cost Impact: Reduces costs by around 20%.
  2. Autonomous Optimization: Automated processes to fine-tune memory and CPU usage.
  3. Cost Impact: Reduces costs by around 20%.
  4. Manage Concurrency:Autonomous Concurrency Management: Leverages traffic prediction and concurrency management to ensure efficient scaling.Cost Impact: Adds an additional 10% in cost reduction.
  5. Autonomous Concurrency Management: Leverages traffic prediction and concurrency management to ensure efficient scaling.
  6. Cost Impact: Adds an additional 10% in cost reduction.
  7. Adapt to New Releases:Release Intelligence: Ensures smooth transitions and optimizations when adopting new service releases or infrastructure upgrades.
  8. Release Intelligence: Ensures smooth transitions and optimizations when adopting new service releases or infrastructure upgrades.

This approach helps achieve continuous, automated optimization without requiring constant manual intervention. By addressing the core areas of memory, CPU, and concurrency, businesses can see noticeable improvements in both performance and cost-efficiency.

Cost and Performance Impact

As we embark on our Kubernetes cost optimization journey, early results provide both insight and encouragement. Currently, with a limited number of Kubernetes environments onboarded, we have realized approximately 2% in cost savings. While this may seem modest, it is important to note that we are only in the early stages of this process, and we anticipate significant improvements as we continue.

Key Insights from Current Kubernetes Cost Optimization:

  • Current Savings: We have achieved approximately 2% savings in the Kubernetes environments that have been optimized so far.
  • Pending Opportunities: Based on the recommendations provided, once we fully implement the optimizations across our entire infrastructure, we are expecting potential cost savings of up to 61%.

These early results show promise, and by scaling these strategies across all clusters, we aim to unlock even more substantial savings.

Conclusion

In conclusion, Palo Alto Networks has successfully addressed the challenges brought by rapid growth and cloud infrastructure expansion through the development of its Autonomous Platform. This platform has streamlined SRE operations, reduced costs, and improved performance by automating repetitive tasks and optimizing resource management. By integrating tools like Sedai for serverless and Kubernetes optimization, the company has further enhanced cost efficiency while maintaining high service reliability. As Palo Alto Networks continues to evolve, the Autonomous Platform plays a crucial role in ensuring scalable, resilient operations that meet the demands of a growing customer base.

Features

APM Tools

Insight Tools

Recommendation Tools

Rule-based Static Optimization

Autonomous Optimization (Sedai)

Comprehensive optimization

Varies

Varies

Enhances production safety

Acts in production

Continuous learning