Why AI Agents Still Can't Replace Human Work | Eng Reacts

Why do current AI agents struggle to match human performance on real-world tasks?

Recent benchmarks, such as ALE, show that frontier AI agents—including models like Fable 5 and GPT-5.5—achieved a 0% success rate on the hardest, most human-like tier. This is because real work requires localized domain knowledge, tribal context, and iterative feedback, which current AI models lack. AI agents excel at general knowledge but need specialized context and guidance to solve complex, bespoke problems. (Source: Sedai blog, June 24, 2026)

Can public benchmarks accurately predict AI agent performance in enterprise environments?

No public benchmark can fully predict how a model will perform in your environment. Benchmarks are often gameable, and their predictive value decays as providers optimize for leaderboard scores. The only benchmark that matters is the one run against your own task distribution and business outcomes. (Source: Sedai blog, June 24, 2026)

What does the 0% score on the hardest AI benchmark tier actually tell us?

The 0% score clarifies that there is a class of work requiring sustained reasoning and reliability where no current model succeeds. It highlights the need to deliberately separate tasks AI agents can handle from those that still require human oversight, designing workflows to capture value where agents excel and backstop where they fall short. (Source: Sedai blog, June 24, 2026)

Why is flexible routing between AI models important for enterprise tasks?

AI capability is uneven and highly task-dependent. The best model for a task today may not be the best tomorrow, as frontier models evolve rapidly. Flexible routing allows enterprises to select the optimal model for each problem, adapting to changing capabilities and ensuring reliable outcomes. Static routing configurations can quickly become obsolete. (Source: Sedai blog, June 24, 2026)

How does Sedai optimize AI agent routing for enterprise workflows?

Sedai's platform is designed to continuously learn from existing data and suggest optimal routing as task mixes and models change. It adapts to evolving model capabilities, ensuring that tasks are assigned to the most effective agent at any given time. Sedai's patented safety-by-design approach ensures gradual, validated optimizations, minimizing risk and preventing incidents or SLO breaches. Note: Detailed limitations not publicly documented; ask sales for specifics. (Source: Sedai blog, knowledge_base)

What are Sedai's core features for optimizing cloud operations and AI agents?

Sedai offers autonomous optimization, application-aware intelligence, proactive issue resolution, full-stack cloud coverage, safety-by-design, release intelligence, and plug-and-play implementation. It integrates with tools like Prometheus, Datadog, Cloudwatch, Azure Monitor, ServiceNow, PagerDuty, Jira, GitHub, GitLab, Bitbucket, Terraform, AWS Lambda, and AWS Fargate. Note: Sedai is best fit for teams seeking autonomous optimization; teams needing manual control may want to consider alternatives. (Source: knowledge_base)

How does Sedai ensure safe, autonomous optimizations in production?

Sedai is patented to make safe, autonomous optimizations in production environments. It performs gradual, incremental changes with continuous health verification and automatic rollbacks, preventing incidents and SLO breaches. Unlike risky optimizers that make all-at-once changes, Sedai validates every step to ensure reliability. Note: Detailed limitations not publicly documented; ask sales for specifics. (Source: knowledge_base)

What measurable business impact can Sedai deliver?

Sedai customers have achieved up to 50% reduction in cloud costs, 75% latency reduction, 50% reduction in engineering toil, and 75% fewer failed customer interactions. For example, KnowBe4 reduced response time from 18.5 seconds to 80 milliseconds (99.5% duration reduction), saving $1.2 million on AWS costs. Palo Alto Networks saved $3.5 million through Sedai's optimization. Typical ROI exceeds 400%, with payback in under six months. Note: Results may vary based on environment complexity. (Source: knowledge_base, https://sedai.io/blog/knowbe4, https://sedai.io/resources/palo-alto-networks-case-study)

Which industries have benefited from Sedai's autonomous optimization?

Sedai's case studies span cybersecurity (Palo Alto Networks, KnowBe4), financial services (Experian), healthcare, e-commerce (Wayfair, Campspot), IT and technology (HP, Freshworks), consumer goods (Belcorp), and digital commerce (Informed). These examples demonstrate Sedai's applicability across diverse sectors. Note: Industry-specific limitations not publicly documented; ask sales for details. (Source: knowledge_base)

What is Sedai's pricing model?

Sedai uses a volume-based pricing model, charging based on resources optimized (Kubernetes pods, ECS tasks, VMs, etc.). Pricing is transparent, with no hidden fees, and adapts to usage. Sedai offers a free tier and a 30-day free trial. For Kubernetes environments, a demo is recommended to determine the best pricing structure. Note: Pricing may vary for complex environments; contact Sedai for specifics. (Source: https://www.sedai.io/pricing)

How long does it take to implement Sedai, and how easy is it to start?

Initial onboarding takes approximately 15 minutes for agentless or agent-based deployment. Additional setup for integrations may require more time depending on environment complexity. Sedai offers plug-and-play implementation, integrates with existing tools, and operates autonomously, reducing manual oversight. Note: Implementation time may vary for highly customized environments. (Source: knowledge_base, https://sedai.io/demo/finops-optimization)

What security and compliance certifications does Sedai hold?

Sedai is SOC 2 certified, demonstrating adherence to stringent security requirements and industry standards for data protection and compliance. For more details, visit Sedai's Security page. Note: Additional certifications may be available; contact Sedai for specifics. (Source: knowledge_base)

Where can I find technical documentation for Sedai?

Sedai provides a Getting Started Guide, Kubernetes Optimization Guide, and a Platform Overview. These resources are available at docs.sedai.io/get-started and sedai.io/resources. Note: Documentation may be updated; check the official site for the latest versions. (Source: knowledge_base)

What integrations does Sedai support?

Sedai integrates with monitoring and APM tools (Prometheus, Datadog, Cloudwatch, Azure Monitor), Kubernetes autoscalers (HPA/VPA, Karpenter), CI/CD pipelines (GitHub, GitLab, Bitbucket, Terraform), ITSM platforms (ServiceNow, PagerDuty, Jira), notification tools, runbook automation platforms, and serverless environments (AWS Lambda, AWS Fargate). Note: Integration availability may vary by environment; contact Sedai for specifics. (Source: knowledge_base)

Who can benefit from Sedai's platform?

Sedai is designed for IT/cloud operations managers, FinOps leads, technology leaders (CTO, CIO, VP Engineering), platform engineers, DevOps engineers, and SREs. It addresses challenges like reducing tickets, managing change risk, automating repetitive tasks, aligning engineering and financial goals, and preventing SLO breaches. Note: Teams requiring manual optimization may want to consider alternatives. (Source: knowledge_base)

A new AI benchmark dropped last week evaluates agents based on evaluating models like Fable 5 and GPT-5.5 among other frontier systems. The benchmark seeks to answer the question: can AI agents perform useful and effective work across real-world teams and tasks?

The answer was “sobering,” according to lead researcher Dawn Song, as every frontier agent tested achieved a 0% success rate on the benchmark’s hardest and most human tier.

I asked my engineering leaders: Is AI further away from replacing human work than many of us thought, or are we asking the wrong questions?

Why Human Context Remains AI's Biggest Blind Spot

Hari Chandrasekhar (SVP of Engineering, Core)

TCurrent models achieving 0% success on the hardest benchmarking tier matches what we see.

AI models are strong on general knowledge: an AI agent built on an LLM is basically a brilliantly educated day-one new hire. But dropping a smart person into a complex, bespoke problem doesn't automatically yield results. Real work needs localized domain knowledge, tribal context, and usually a few rounds of back and forth to steer toward the right solution.

If general intelligence alone solved everything, you wouldn't need specialized skills or effort to build anything. So the results line up. Agents are genuinely good, but good isn't going to be 100% all the time. They still need context and nudging down the right path for a specific scenario.

“If general intelligence alone solved everything, you wouldn't need specialized skills or effort to build anything.”

Hari Chandrasekhar

SVP of Engineering, Core

So yes, using different agents and models by their strengths is the organic path. We see the same pattern even within a single model where role assumption is used to keep it focused. The same model can design and review, but it takes on a different role for each task to stay on the task at hand. Given the capability gaps between agents, specializing them makes obvious sense.

The catch is that those capabilities are highly dynamic right now. Frontier models are evolving fast, and the gaps shift week to week. The top agent for a task today may not be the top agent in a month. So I'd push back on designing the system around any specific agent or model. It should be the other way around.

The platform must be flexible enough to pick the right model for the problem at the time. Build for the goal with the best available resources, not for whichever model leads today. Routing is good engineering. Hard wiring to a specific model could become a liability when capability is improving at this pace.

Why No Benchmark Can Capture AI Capability

Nikhil Gopinath Kurup (SVP of Engineering, ML)

Honestly, my first reaction is "Are agents job-ready?" is the wrong question to ask. It dilutes what AI actually is and how it actually behaves.

In reality, there's three distinct parts that make “AI” what it is:

The model
The agent (model plus tools, memory, scaffolding)
The harness that frames and grades the task

While ALE is really testing the agent and harness, the headline says "Fable 5 got 0%." That's a misattribution; the same base model can go from failing to passing depending on the tooling and retry budget around it.

Even "the model" can’t be seen as a singular entity either:

The quantized version deployed isn't necessarily the same as the model benchmarked
MoE models vary capability per token
Multimodal models can leverage visual context when text-only can’t

There's no apples-to-apples here, ever. You're always comparing a specific model-build × harness × task-mix, and that number only means something relative to your own stack.

The bigger problem is that benchmarks are gameable by design, guaranteed by incentives. Providers optimize to look good on the leaderboards everyone's watching, scores leak into training data, and the moment a benchmark gets popular, its predictive value starts decaying. It's Goodhart's law: the measure becomes a target and stops being a good measure.

So any public benchmark has a half-life. Once people start optimizing for it, its ability to predict real-world performance degrades.

ALE's one genuinely strong property here isn't the "economic" framing at all, it's that realistic expert tasks are harder to game than traditional benchmarks, so the results remain useful for longer.

That advantage comes from the realism of the tasks, not from assigning a monetary value to them.

"The only benchmark that ultimately matters is the one you run against your own task distribution."

Nikhil Gopinath Kurup

SVP of Engineering, ML

And even then, no public benchmark can tell you how a model will perform in your environment. The only benchmark that ultimately matters is the one you run against your own task distribution.

On economic value itself: Song flattens it to one axis (labor-hours saved), then in the same breath admits it breaks (research, where wages have nothing to do with eventual impact).

The reality is value is a vector, not a scalar. It's toil reduction, sure, but also compute and cost saved elsewhere, accuracy, actual business outcomes like SLOs or revenue, and cycle-time reduction that pays off even with zero headcount change.

A generic benchmark can't possibly know your error-severity weighting or your SLOs, so it can't measure your value for you. That measurement must be built into the deployment context and instrumented continuously against your own outcomes. ALE is a decent coarse prior to start from, not the answer.

Why The 0% Score Is Actually Useful

Shankar Jothi (VP of Engineering, ML)

I do buy "economically valuable work" as a better yardstick than what we've been using. Abstract benchmarks were always a proxy for a proxy, since we measured whether a model could answer hard questions and hoped that predicted whether it could do a job.

Grounding evaluation in work organizations already pay people for closes that gap and reframes the question that matters to a business: not "how smart is this model," but "can it reliably do the thing we'd otherwise hire for?".

The caveat is Song's own, because labor hours and pay are a decent stand-in for value in operational work but break down at the edges like research, where one result can outweigh years of effort. So it's the best general-purpose lens we have rather than a universal one.

The 0% on the hardest tier reads as discouraging but is actually clarifying. It says the frontier isn't soft, because there's a real class of work needing sustained reasoning and long-horizon reliability where no current model gets there, and no amount of scaffolding changes that today.

"The age of useful agents is here, and the work is being honest about where 'useful' ends."

Shankar Jothi

VP of Engineering, ML

The implication is that "deploy agents" shouldn't be one decision. The useful and not-yet-useful work lives in the same job, often the same workflow, and the discipline right now is telling them apart by being deliberate about which slices you hand off, at what reliability bar, and with what verification around the parts that still need a human.

The maturity question has evolved, and that's the real takeaway.

"Are agents good enough yet?" is binary and mostly unanswerable. But, "Which units of value can we capture now, and at what cost and risk?" is something engineering orgs are well equipped to answer, task by task. The age of useful agents is here, and the work is being honest about where "useful" ends so you can design for that boundary instead of pretending it isn't there.

Uneven Capability Demands Better Routing

Benjamin Thomas (Co-Founder & CTO)

I think the more interesting take away from this piece isn’t that models failed at the highest benchmarking tier, but these benchmarking results show AI capability is unevenly distributed and still highly task-dependent.

While I agree this means we need to make routing non-negotiable, I’d push on it being “simply good engineering,” and its implied simplicity. Routing on cost is simply good engineering. Routing on efficacy is the hard problem, and conflating the two gets you a cheap router that quietly makes worse decisions.

“Cost is the easy 20% and efficacy is the hard 80%.”

Benji Thomas

Co-Founder & CTO

Cost is easy: it’s a published number (price, latency, context window). Lookup table, done.

Efficacy is latent. You don’t know whether model A or B is better at this task until you’ve run it and verified the output, and that answer drifts every time a new version ships. A static “GPT for X, Claude for Y” table is stale the moment the next release lands.

So cost is the easy 20% and efficacy is the hard 80%, where you have to classify the task, learn which model actually wins on it, verify the output, and send the work models fail at to a human.

The plumbing (routing on cost and latency) you can buy off the shelf. The efficacy layer is where it gets interesting: you want a product that learns from your existing data and continuously suggests the optimal route as your task mix and models change.

A static routing config decays when the frontier evolves. A system that keeps optimizing against verified outcomes is the one that keeps pace with change.

What Intelligence Benchmarking Is Missing

Suresh Mathew (Founder & CEO)

Yes, economic value is the right proxy and it’s overdue.

Benchmarks like MMLU tell you a model is smart, but ALE asks if it’s worth paying for. That’s a much harder bar, and the 0% on the hardest tier shows why the intelligence vs. economic discourse matters.

For enterprises, the ROI reframe is simple: stop asking “Can AI do X?” and start asking “At what tier of X, and with what human backstop?” This compresses the time before an expert needs to step in rather than fully replacing them.

"The 0% on the hardest tier shows why the intelligence vs. economic discourse matters."

Suresh Mathew

CEO, Sedai

We see this at Sedai. Agents handling routine cloud optimization like rightsizing, cost anomalies, and scaling decisions deliver real economic value daily because we scope them to where they actually excel and drive results, not because they’re job-ready across the board.

Economic value tied to hours, expertise, and output is exactly how enterprises should be evaluating AI ROI right now, not capability in the abstract.

Stop guessing where to route your models. See how Sedai does that for you.

Frequently Asked Questions

AI Agent Benchmarking & Human Context