Learn how Palo Alto Networks is Transforming Platform Engineering with AI Agents. Register here

Attend a Live Product Tour to see Sedai in action.

Register now
More
Close

The Autonomous Cloud Optimization Spectrum: 6 Levels of Autonomy

Last updated

November 20, 2024

Published
Topics
Last updated

November 20, 2024

Published
Topics
No items found.

Reduce your cloud costs by 50%, safely

  • Optimize compute, storage and data

  • Choose copilot or autopilot execution

  • Continuously improve with reinforcement learning

CONTENTS

The Autonomous Cloud Optimization Spectrum: 6 Levels of Autonomy

Introduction

As organizations increasingly adopt cloud-native architectures and microservices, the complexity of managing these environments has grown exponentially. Traditional approaches to cloud management are struggling to keep pace with this evolution, leading to a host of challenges that threaten to undermine the very benefits that drew us to the cloud in the first place. It's time for a paradigm shift in how we approach cloud optimization and management – enter the era of autonomous cloud systems.

The Challenges of Modern Cloud Operations

Today's cloud-native applications, built on microservices architectures, offer unprecedented flexibility and scalability. However, they also introduce a level of complexity that is pushing traditional operations teams to their limits and creating three critical challenges:

  • Operational Toil: Engineers are drowning in low-value, repetitive tasks that consume precious time and resources. According to Google's Site Reliability Engineering (SRE) principles, teams should aim to spend no more than 50% of their time on toil. Yet, many organizations find themselves far exceeding this threshold.  And in this age of AI & automation we should question any repetitive toil activities.
  • Rising Costs: As cloud adoption grows, so does cloud waste. Industry reports suggest that up to 27% of cloud spend is wasted, amounting to a staggering $95 billion in 2024 alone based on the estimated combined IaaS & PaaS Spend of $352B. This level of inefficiency is unsustainable and directly impacts our bottom lines. Gartner VP Tony Iams has noted that the cost benefits of modern container architectures can be lost if services like Kubernetes are not managed effectively.
  • Availability and Performance Issues: Despite significant investments in cloud infrastructure, many organizations continue to struggle with service interruptions and performance degradation, leading to failed customer interactions (FCIs) and lost revenue.  With reliance on manual optimizations to save costs, human error can also creep in and cause incidents.

The Promise of Autonomous Cloud Systems

To address these challenges, we need to embrace a new approach made possible by the emergence of powerful AI systems: autonomous cloud systems. But what exactly does "autonomous" mean?

An autonomous cloud system is an agent or platform capable of performing complex cloud management tasks with substantially reduced human intervention for extended periods. These systems leverage artificial intelligence and machine learning to understand the environment, make decisions, and take actions independently.  They may operate in either a copilot mode (AI makes recommendations and humans approve them) or autopilot (humans set the higher level goals and the AI implements them).  Autonomous systems differ from traditional automation (e.g., Terraform, autoscalers) which are based on a series of “if/then” rules.  Instead, autonomous systems use intelligent AI that can learn and adapt to new information.

The benefits of autonomous cloud operations are compelling:

  • Dramatic reduction in operational toil, freeing up engineering talent for high-value work
  • Optimized cloud spend through continuous, AI-driven resource allocation
  • Improved service availability and performance through proactive management and rapid issue resolution

The Spectrum of Cloud Autonomy

To understand the journey towards autonomous cloud management, it's helpful to borrow a framework from another industry that's rapidly advancing in autonomy: the automotive sector. The Society of Automotive Engineers (SAE) has defined six levels of driving automation.  The underlying philosophy can be adapted to cloud management, with an important distinction between automation (at levels 1-3) and autonomy (levels 4-6).

Here are the levels we propose::

  • Level 0: No Automation - All cloud management tasks are performed manually by human operators.
  • Level 1: Observability - Operators have access to metrics from an APM or observability tool to gain insights into the section and receive pre-programmed alerts, but that platform does not take actions.
    Level 2: Operator Assistance
    - Basic monitoring tools are in place, but all decisions and actions are made by humans.
  • Level 3: Automation - Routine tasks are automated using predefined if/then rules. While this level reduces some manual work, it lacks the intelligence to adapt to complex or unforeseen situations.
  • Level 4: Partial Autonomy (Copilot)- AI systems can perform many tasks independently and make some decisions, but human oversight is still required. The AI acts as a copilot, providing intelligent recommendations that humans can choose to implement.
  • Level 5: High Autonomy (Autopilot) - The AI system handles most cloud operations autonomously, making intelligent decisions based on complex data analysis. Human intervention is only needed in exceptional circumstances.
  • Level 6: Full Autonomy (Advanced Autopilot) - The AI system manages all aspects of cloud operations without any human intervention, adapting to new situations and optimizing performance across all conditions.


Here’s the spectrum in table format:


It's crucial to understand the difference between automation and autonomy in this context:

  • Automation, represented by Level 3, involves executing predefined sequences of actions based on set rules. While useful for routine tasks, it lacks the flexibility to handle complex, dynamic cloud environments effectively.
  • Autonomy, on the other hand, leverages artificial intelligence to make decisions and take actions based on a deep understanding of the environment. Autonomous systems can learn, adapt, and optimize in ways that go far beyond simple rule-based automation.

To drill down further, let’s see the differing roles of humans, observability, automation and AI are in managing cloud work across the autonomy spectrum.  We’ll look at generating data, making a recommendation, approving and executing that action to achieve a desired goal.  What we see is that the burden on human operators is high at low levels of autonomy but can be progressively reduced as autonomy increases.  

To further illustrate these levels of autonomy, let's look at how they apply to a specific cloud management task: Kubernetes rightsizing and scaling.

This table demonstrates how the levels of autonomy progressively reduce human involvement while increasing the intelligence and capability of the system, grounded in actual operations with Kubernetes examples. As we move up the levels, we see a shift from manual, reactive management to proactive, intelligent optimization that takes into account complex factors like business impact.

The Current State of Cloud Management

Most organizations today operate at Level 2 or 3 of the autonomy spectrum. They've implemented basic monitoring and alerting systems, and may have some degree of automated responses to common issues, and access to recommendations (e.g., from their cloud provider). However, these automated systems often struggle with the complexity of modern cloud environments, leading to suboptimal performance and requiring frequent human intervention.

The good news is that Level 5 autonomy (Autopilot) is achievable with current technology for many cloud native applications, and to Level 4 (Copilot) for legacy applications that involve ad hoc code.  Advanced AI-driven platforms can now handle a wide range of cloud optimization and management tasks with minimal human oversight, adapting to changing conditions and making intelligent decisions to optimize performance, cost, and reliability.


Autonomous systems are growing and are part of a wider shift triggered by AI - Gartner predicts that by 2027, the number of platform engineering teams using AI to augment every phase of the SDLC will have increased from 5% to 40%.

Benefits of Moving Up the Autonomy Spectrum

The advantages of advancing along the autonomy spectrum are significant and measurable:

  • Cloud cost savings:  Autonomous systems help systems perform at their optimal cost.
  • Performance Improvements: Organizations implementing autonomous cloud management have seen substantial quarter-over-quarter improvements in latency reduction, with cumulative gains of hundreds of days of reduced latency over time.
  • Productivity: Autonomous systems can perform actions at a fraction of the cost and time compared to human operators. 
  • Safety: Autonomous systems avoid human error, which causes up to 85% of incidents.
  • Scalability: Autonomous systems can manage increasingly complex environments without a proportional increase in human resources, allowing organizations to scale their cloud operations more effectively.

Sedai is one of these autonomous systems; you can see customer results they have achieved here.

Implementing Autonomous Cloud Management

Moving towards autonomous cloud management is a journey that requires careful planning and execution. Here are some steps to get started:

1. Assess Your Current State: Evaluate where your organization sits on the autonomy spectrum. Are you still relying on manual operations, or have you implemented some level of automation & observability? Consider your capabilities and limitations.

2. Set Clear Goals: Determine what level of autonomy you're aiming for.  This will often be a function of your scale; at very small scale manual operations may be acceptable; at large scale autonomous systems become the most cost effective model.  Is your goal to reach Level 4 (Copilot) in the near term, or are you ready to push towards Level 5 (Autopilot)? Define specific outcomes you want to achieve (e.g., cost reduction, performance improvement, FCI reduction).

3. Invest in the Right Tools: Look for or build platforms that offer advanced AI and machine learning capabilities specifically designed for cloud management. These should go beyond simple automation to provide true autonomous decision-making capabilities.

4. Upskill Your Team: As you move towards higher levels of autonomy, focus on developing your team's higher-level skills. They'll need to shift from executing routine tasks to overseeing and guiding autonomous systems, requiring skills in areas like strategic planning and complex problem-solving.

5. Start Small and Scale: Begin with a pilot project in a valuable, non-critical area (e.g., reducing cloud costs in dev/test environments), prove the concept, and then gradually expand the scope of autonomous management. This approach allows you to build confidence in the system and refine your processes as you go.

The Future of Cloud Management

As we look to the future, it's clear that autonomous systems will play an increasingly central role in cloud management. We can expect to see:

  • More sophisticated AI models that can handle even the most complex cloud environments
  • Greater integration between autonomous cloud management and other IT systems
  • A shift in the role of cloud engineers from hands-on operators to strategic overseers of autonomous systems with the bandwidth to pursue strategic initiatives

Conclusion

The move towards autonomous cloud management isn't just a technological shift – it's a strategic imperative. Organizations that embrace this approach will be better positioned to harness the full potential of the cloud, driving innovation, reducing costs, and delivering superior experiences to their customers.

As you consider your cloud strategy for the coming years, ask yourself: Where does your organization sit on the autonomy spectrum, and what steps can you take to move up? The future of cloud management is autonomous, and the time to start that journey is now.

Note: This post was created with help from Rachit Lohani, CTO of Paylocity.  Paylocity is one of the fastest-growing SaaS businesses in HCM.  Rachit was previously Head of Engineering at Atlassian and Director of Engineering at Intuit.  Rachit also serves as an advisor to Sedai, providing advice on product development since November 2020.  

Was this content helpful?

Thank you for submitting your feedback.
Oops! Something went wrong while submitting the form.

CONTENTS

The Autonomous Cloud Optimization Spectrum: 6 Levels of Autonomy

Published on
Last updated on

November 20, 2024

Max 3 min
The Autonomous Cloud Optimization Spectrum: 6 Levels of Autonomy

Introduction

As organizations increasingly adopt cloud-native architectures and microservices, the complexity of managing these environments has grown exponentially. Traditional approaches to cloud management are struggling to keep pace with this evolution, leading to a host of challenges that threaten to undermine the very benefits that drew us to the cloud in the first place. It's time for a paradigm shift in how we approach cloud optimization and management – enter the era of autonomous cloud systems.

The Challenges of Modern Cloud Operations

Today's cloud-native applications, built on microservices architectures, offer unprecedented flexibility and scalability. However, they also introduce a level of complexity that is pushing traditional operations teams to their limits and creating three critical challenges:

  • Operational Toil: Engineers are drowning in low-value, repetitive tasks that consume precious time and resources. According to Google's Site Reliability Engineering (SRE) principles, teams should aim to spend no more than 50% of their time on toil. Yet, many organizations find themselves far exceeding this threshold.  And in this age of AI & automation we should question any repetitive toil activities.
  • Rising Costs: As cloud adoption grows, so does cloud waste. Industry reports suggest that up to 27% of cloud spend is wasted, amounting to a staggering $95 billion in 2024 alone based on the estimated combined IaaS & PaaS Spend of $352B. This level of inefficiency is unsustainable and directly impacts our bottom lines. Gartner VP Tony Iams has noted that the cost benefits of modern container architectures can be lost if services like Kubernetes are not managed effectively.
  • Availability and Performance Issues: Despite significant investments in cloud infrastructure, many organizations continue to struggle with service interruptions and performance degradation, leading to failed customer interactions (FCIs) and lost revenue.  With reliance on manual optimizations to save costs, human error can also creep in and cause incidents.

The Promise of Autonomous Cloud Systems

To address these challenges, we need to embrace a new approach made possible by the emergence of powerful AI systems: autonomous cloud systems. But what exactly does "autonomous" mean?

An autonomous cloud system is an agent or platform capable of performing complex cloud management tasks with substantially reduced human intervention for extended periods. These systems leverage artificial intelligence and machine learning to understand the environment, make decisions, and take actions independently.  They may operate in either a copilot mode (AI makes recommendations and humans approve them) or autopilot (humans set the higher level goals and the AI implements them).  Autonomous systems differ from traditional automation (e.g., Terraform, autoscalers) which are based on a series of “if/then” rules.  Instead, autonomous systems use intelligent AI that can learn and adapt to new information.

The benefits of autonomous cloud operations are compelling:

  • Dramatic reduction in operational toil, freeing up engineering talent for high-value work
  • Optimized cloud spend through continuous, AI-driven resource allocation
  • Improved service availability and performance through proactive management and rapid issue resolution

The Spectrum of Cloud Autonomy

To understand the journey towards autonomous cloud management, it's helpful to borrow a framework from another industry that's rapidly advancing in autonomy: the automotive sector. The Society of Automotive Engineers (SAE) has defined six levels of driving automation.  The underlying philosophy can be adapted to cloud management, with an important distinction between automation (at levels 1-3) and autonomy (levels 4-6).

Here are the levels we propose::

  • Level 0: No Automation - All cloud management tasks are performed manually by human operators.
  • Level 1: Observability - Operators have access to metrics from an APM or observability tool to gain insights into the section and receive pre-programmed alerts, but that platform does not take actions.
    Level 2: Operator Assistance
    - Basic monitoring tools are in place, but all decisions and actions are made by humans.
  • Level 3: Automation - Routine tasks are automated using predefined if/then rules. While this level reduces some manual work, it lacks the intelligence to adapt to complex or unforeseen situations.
  • Level 4: Partial Autonomy (Copilot)- AI systems can perform many tasks independently and make some decisions, but human oversight is still required. The AI acts as a copilot, providing intelligent recommendations that humans can choose to implement.
  • Level 5: High Autonomy (Autopilot) - The AI system handles most cloud operations autonomously, making intelligent decisions based on complex data analysis. Human intervention is only needed in exceptional circumstances.
  • Level 6: Full Autonomy (Advanced Autopilot) - The AI system manages all aspects of cloud operations without any human intervention, adapting to new situations and optimizing performance across all conditions.


Here’s the spectrum in table format:


It's crucial to understand the difference between automation and autonomy in this context:

  • Automation, represented by Level 3, involves executing predefined sequences of actions based on set rules. While useful for routine tasks, it lacks the flexibility to handle complex, dynamic cloud environments effectively.
  • Autonomy, on the other hand, leverages artificial intelligence to make decisions and take actions based on a deep understanding of the environment. Autonomous systems can learn, adapt, and optimize in ways that go far beyond simple rule-based automation.

To drill down further, let’s see the differing roles of humans, observability, automation and AI are in managing cloud work across the autonomy spectrum.  We’ll look at generating data, making a recommendation, approving and executing that action to achieve a desired goal.  What we see is that the burden on human operators is high at low levels of autonomy but can be progressively reduced as autonomy increases.  

To further illustrate these levels of autonomy, let's look at how they apply to a specific cloud management task: Kubernetes rightsizing and scaling.

This table demonstrates how the levels of autonomy progressively reduce human involvement while increasing the intelligence and capability of the system, grounded in actual operations with Kubernetes examples. As we move up the levels, we see a shift from manual, reactive management to proactive, intelligent optimization that takes into account complex factors like business impact.

The Current State of Cloud Management

Most organizations today operate at Level 2 or 3 of the autonomy spectrum. They've implemented basic monitoring and alerting systems, and may have some degree of automated responses to common issues, and access to recommendations (e.g., from their cloud provider). However, these automated systems often struggle with the complexity of modern cloud environments, leading to suboptimal performance and requiring frequent human intervention.

The good news is that Level 5 autonomy (Autopilot) is achievable with current technology for many cloud native applications, and to Level 4 (Copilot) for legacy applications that involve ad hoc code.  Advanced AI-driven platforms can now handle a wide range of cloud optimization and management tasks with minimal human oversight, adapting to changing conditions and making intelligent decisions to optimize performance, cost, and reliability.


Autonomous systems are growing and are part of a wider shift triggered by AI - Gartner predicts that by 2027, the number of platform engineering teams using AI to augment every phase of the SDLC will have increased from 5% to 40%.

Benefits of Moving Up the Autonomy Spectrum

The advantages of advancing along the autonomy spectrum are significant and measurable:

  • Cloud cost savings:  Autonomous systems help systems perform at their optimal cost.
  • Performance Improvements: Organizations implementing autonomous cloud management have seen substantial quarter-over-quarter improvements in latency reduction, with cumulative gains of hundreds of days of reduced latency over time.
  • Productivity: Autonomous systems can perform actions at a fraction of the cost and time compared to human operators. 
  • Safety: Autonomous systems avoid human error, which causes up to 85% of incidents.
  • Scalability: Autonomous systems can manage increasingly complex environments without a proportional increase in human resources, allowing organizations to scale their cloud operations more effectively.

Sedai is one of these autonomous systems; you can see customer results they have achieved here.

Implementing Autonomous Cloud Management

Moving towards autonomous cloud management is a journey that requires careful planning and execution. Here are some steps to get started:

1. Assess Your Current State: Evaluate where your organization sits on the autonomy spectrum. Are you still relying on manual operations, or have you implemented some level of automation & observability? Consider your capabilities and limitations.

2. Set Clear Goals: Determine what level of autonomy you're aiming for.  This will often be a function of your scale; at very small scale manual operations may be acceptable; at large scale autonomous systems become the most cost effective model.  Is your goal to reach Level 4 (Copilot) in the near term, or are you ready to push towards Level 5 (Autopilot)? Define specific outcomes you want to achieve (e.g., cost reduction, performance improvement, FCI reduction).

3. Invest in the Right Tools: Look for or build platforms that offer advanced AI and machine learning capabilities specifically designed for cloud management. These should go beyond simple automation to provide true autonomous decision-making capabilities.

4. Upskill Your Team: As you move towards higher levels of autonomy, focus on developing your team's higher-level skills. They'll need to shift from executing routine tasks to overseeing and guiding autonomous systems, requiring skills in areas like strategic planning and complex problem-solving.

5. Start Small and Scale: Begin with a pilot project in a valuable, non-critical area (e.g., reducing cloud costs in dev/test environments), prove the concept, and then gradually expand the scope of autonomous management. This approach allows you to build confidence in the system and refine your processes as you go.

The Future of Cloud Management

As we look to the future, it's clear that autonomous systems will play an increasingly central role in cloud management. We can expect to see:

  • More sophisticated AI models that can handle even the most complex cloud environments
  • Greater integration between autonomous cloud management and other IT systems
  • A shift in the role of cloud engineers from hands-on operators to strategic overseers of autonomous systems with the bandwidth to pursue strategic initiatives

Conclusion

The move towards autonomous cloud management isn't just a technological shift – it's a strategic imperative. Organizations that embrace this approach will be better positioned to harness the full potential of the cloud, driving innovation, reducing costs, and delivering superior experiences to their customers.

As you consider your cloud strategy for the coming years, ask yourself: Where does your organization sit on the autonomy spectrum, and what steps can you take to move up? The future of cloud management is autonomous, and the time to start that journey is now.

Note: This post was created with help from Rachit Lohani, CTO of Paylocity.  Paylocity is one of the fastest-growing SaaS businesses in HCM.  Rachit was previously Head of Engineering at Atlassian and Director of Engineering at Intuit.  Rachit also serves as an advisor to Sedai, providing advice on product development since November 2020.  

Was this content helpful?

Thank you for submitting your feedback.
Oops! Something went wrong while submitting the form.