Sedai Logo

The Case Against LLMs in Your CI/CD Pipeline

SM

Suresh Mathew

Founder & CEO

April 15, 2026

The Case Against LLMs in Your CI/CD Pipeline

Featured

My content team is constantly sending me articles to read. When we sat down last week to talk about the “end of CI/CD pipelines,” and how AI is upending what we know as CI/CD, I initially balked.

In what world would an engineering team ever let AI touch CI/CD?

The article I read highlights how AI agents are now replacing traditional pipelines by debugging tests, deploying code, & triaging incidents without human intervention (GitHub is already embedding agent runners natively into CI/CD). 

But as someone who thinks about autonomy constantly, the idea of agentic CI/CD pipelines is a bit concerning to me. 

Most systems that claim to be autonomous still rely on human-defined rules & scripts that break the moment something unexpected happens. Agentic CI/CD is no different, and in a pipeline that touches production, that's a problem.

As we start to see engineering teams shift their CI/CDs to "agentic judgement," debugging changes entirely. 

In traditional CI/CDs, a failed health check tells you exactly what broke and a misconfigured YAML file is at least findable. But introducing probabilistic AI, like LLMs, into the pipeline makes production decisions non-deterministic. You can log what the agent did, but you can't guarantee it would make the same call twice.

It’s understandable why the industry is responding to this real shift by reaching for the most familiar AI tool available: LLMs. But LLMs don't belong in production decision-making, not because agentic DevOps is a bad idea, but because we can't trust "close enough" when the action is a production deployment. 

Here's what getting it right actually requires.

What CI/CD Actually Got Right

While I was an engineer at eBay, we had train release cycles where we would release every two months, sometimes longer. When CI/CD was introduced, it fundamentally broke that release model. What used to take months collapsed into hours, then minutes. 

The feedback loop got so tight that releasing stopped being an event and started being a routine. The key to that was determinism: every step in the pipeline was explicit, traceable, & repeatable. If something failed, you knew exactly where & why it happened.

And for a long time, that was enough. But the judgment calls — what to do when a deployment behaved unexpectedly, how to interpret a failure that didn't fit a known pattern, or when to roll back versus push forward — still fell back to us engineers.

We trusted ourselves to make the right decisions, but the toil of making them slowed us down. So now, when engineering teams try to hand that judgment to AI agents, the instinct makes sense: you want to operationalize judgment.

However, agents can only approximate judgment, and in a production pipeline, that approximation introduces risk.

The Problem With LLMs in Your Pipeline

Most of what the industry is calling "agentic DevOps" right now is just LLMs layered on top of existing pipelines. 

But because LLMs are probabilistic reasoners, they're both incredibly smart & unbelievably dumb at the same time, which is what makes them so dangerous in production.

Where It Breaks In Practice

Here's what that actually looks like in practice: a rollout looks clean, SLOs hold, and the agent moves on. An hour later, latency spikes under real traffic. 

The agent never sees it. It checked the metrics at deploy time, saw green, & considered the job done. What happens under real traffic an hour later isn't its problem anymore because it's already on to the next task.

And if the agent does catch the problem and decides to roll back, that's when things get worse. If that deployment included a database schema change, for example, rolling back the code without handling the schema leaves you in a broken state. 

Or, maybe you made a customer commitment tied to that release, and rolling back means breaking an SLA. The agent doesn't know any of that. It only knows "metrics look bad, roll back."

I think about this in the same way I think about cars & aircraft. You can build a faster & faster car, but no matter how fast it goes, it will never fly. If you want to fly, you have to build an aircraft. It's a different vehicle entirely, designed for a different problem. 

Teams layering LLMs onto DevOps pipelines are just building a faster car. They’re not turning into an autonomous engine, and that leaves them open to vulnerabilities.

What Actually Works

LLMs make decisions based on patterns they've learned without ever seeing the real outcomes in your actual system. Reinforcement learning-based autonomous systems are different: they make decisions and learn from what actually happens. 

That feedback loop, decisions grounded in real outcomes, is the foundation of safe agentic DevOps. The question is what you build on top of it.

How To Build A Safe Agentic CI/CD Pipeline

The teams that get agentic DevOps right will be the ones who treat it as an architecture problem. Teams often focus on what an agent can do and how fast it can do it rather than what constrains it. 

Safe agentic systems must rely on layers of guardrails, each one constraining the one above it. 

There are three layers to safely implement an agentic CI/CD pipeline:

  • Agent observability
  • Incremental execution & explicit rollback
  • Continuous validation against live signals

Agent Observability

Observability is nothing new; the market is flooded with observability tools & dashboards. But what is missing from current observability is the ability to see how agents are reasoning.

In a CI/CD pipeline, this is especially dangerous. An agent misclassifying a test failure or misreading an incident signal doesn't just produce one wrong output, it cascades into compounding errors.

"In production, only act on true confidence. There's a difference between 99.9% and 100%. At 99.9%, you wait. At 100%, you move."

Suresh Mathew Headshot

Suresh Mathew

CEO, Sedai

This might mean a deployment proceeds when it shouldn't or a rollback gets skipped. By the time something visibly breaks in production, you're three or four decisions deep with no decision log to trace back through.

So how can engineers effectively implement observability? 

At Sedai, we log everything: the prior state, the new state, and the reasoning behind the decision. We ask questions like:

  • Why did we resize that deployment? Because a new release slowed the application down & created a cascade effect. 
  • Why did we adjust resources there? Because traffic shifted & dependencies changed. 

An engineer reviewing that log should see the complete picture; not just "action taken," but the signals that triggered it and what we expected to happen. 

For me, that's non-negotiable. If you can't explain a decision, you can't trust it.

Incremental Execution & Explicit Rollback

The most terrifying possibility I see in agentic CI/CD is the lack of a clearly defined rollback. Errors are inevitable, especially in systems that are still learning, but no action that affects production should be irreversible without explicit confirmation.  

Most of the time when rollback happens, it's not because a decision was wrong, but because the context around the decision shifted. This context is what we build into Sedai; we can watch for novel signals that make a previously safe decision unsafe going forward

For teams starting out, keep it simple. Build a rule: if you detect behavior or data you've never observed, air on the side of caution and roll back the last autonomous change. 

It’s better to revert & investigate than push forward into unknown territory. Honestly, with proper guardrails & continuous observation, you shouldn't hit rollback often.

Continuous Validation Against Live Signals

Continuous validation is where most agentic DevOps implementations fall apart.

Everyone’s instinct is to validate at deployment time: run the tests, check the thresholds, and confirm the rollout looks clean. 

But in an agentic system, you must watch everything observable, from the obvious signals like SLOs & latency, to the non-obvious, like dependency shifts & resource allocation. Any deviation from baseline is a signal worth capturing.

As for when to act: never act on high confidence. Act only on true confidence. There's a meaningful difference between 99.9% and 100%. At 99.9%, you wait; at 100%, you move.

Conclusion

Agentic DevOps is coming because the productivity gains are real: faster deployments, fewer manual interventions, and pipelines that can respond without waiting for a human. But with that comes risk, and engineering teams must be realistic about what LLMs can’t do, and build for failure before it happens.

At Sedai, this is the only model we've ever known. Every autonomous action we take is grounded in real system behavior, constrained by explicit policy, & reversible by design. Not because we're cautious, but because that's what safe autonomy requires.