A new AI benchmark dropped last week evaluates agents based on evaluating models like Fable 5 and GPT-5.5 among other frontier systems. The benchmark seeks to answer the question: can AI agents perform useful and effective work across real-world teams and tasks?
The answer was “sobering,” according to lead researcher Dawn Song, as every frontier agent tested achieved a 0% success rate on the benchmark’s hardest and most human tier.
I asked my engineering leaders: Is AI further away from replacing human work than many of us thought, or are we asking the wrong questions?
Why Human Context Remains AI's Biggest Blind Spot
Hari Chandrasekhar (SVP of Engineering, Core)
TCurrent models achieving 0% success on the hardest benchmarking tier matches what we see.
AI models are strong on general knowledge: an AI agent built on an LLM is basically a brilliantly educated day-one new hire. But dropping a smart person into a complex, bespoke problem doesn't automatically yield results. Real work needs localized domain knowledge, tribal context, and usually a few rounds of back and forth to steer toward the right solution.
If general intelligence alone solved everything, you wouldn't need specialized skills or effort to build anything. So the results line up. Agents are genuinely good, but good isn't going to be 100% all the time. They still need context and nudging down the right path for a specific scenario.
“If general intelligence alone solved everything, you wouldn't need specialized skills or effort to build anything.”
Hari Chandrasekhar
SVP of Engineering, Core
So yes, using different agents and models by their strengths is the organic path. We see the same pattern even within a single model where role assumption is used to keep it focused. The same model can design and review, but it takes on a different role for each task to stay on the task at hand. Given the capability gaps between agents, specializing them makes obvious sense.
The catch is that those capabilities are highly dynamic right now. Frontier models are evolving fast, and the gaps shift week to week. The top agent for a task today may not be the top agent in a month. So I'd push back on designing the system around any specific agent or model. It should be the other way around.
The platform must be flexible enough to pick the right model for the problem at the time. Build for the goal with the best available resources, not for whichever model leads today. Routing is good engineering. Hard wiring to a specific model could become a liability when capability is improving at this pace.
Why No Benchmark Can Capture AI Capability
Nikhil Gopinath Kurup (SVP of Engineering, ML)
Honestly, my first reaction is "Are agents job-ready?" is the wrong question to ask. It dilutes what AI actually is and how it actually behaves.
In reality, there's three distinct parts that make “AI” what it is:
- The model
- The agent (model plus tools, memory, scaffolding)
- The harness that frames and grades the task
While ALE is really testing the agent and harness, the headline says "Fable 5 got 0%." That's a misattribution; the same base model can go from failing to passing depending on the tooling and retry budget around it.
Even "the model" can’t be seen as a singular entity either:
- The quantized version deployed isn't necessarily the same as the model benchmarked
- MoE models vary capability per token
- Multimodal models can leverage visual context when text-only can’t
There's no apples-to-apples here, ever. You're always comparing a specific model-build × harness × task-mix, and that number only means something relative to your own stack.
The bigger problem is that benchmarks are gameable by design, guaranteed by incentives. Providers optimize to look good on the leaderboards everyone's watching, scores leak into training data, and the moment a benchmark gets popular, its predictive value starts decaying. It's Goodhart's law: the measure becomes a target and stops being a good measure.
So any public benchmark has a half-life. Once people start optimizing for it, its ability to predict real-world performance degrades.
ALE's one genuinely strong property here isn't the "economic" framing at all, it's that realistic expert tasks are harder to game than traditional benchmarks, so the results remain useful for longer.
That advantage comes from the realism of the tasks, not from assigning a monetary value to them.
"The only benchmark that ultimately matters is the one you run against your own task distribution."
Nikhil Gopinath Kurup
SVP of Engineering, ML
And even then, no public benchmark can tell you how a model will perform in your environment. The only benchmark that ultimately matters is the one you run against your own task distribution.
On economic value itself: Song flattens it to one axis (labor-hours saved), then in the same breath admits it breaks (research, where wages have nothing to do with eventual impact).
The reality is value is a vector, not a scalar. It's toil reduction, sure, but also compute and cost saved elsewhere, accuracy, actual business outcomes like SLOs or revenue, and cycle-time reduction that pays off even with zero headcount change.
A generic benchmark can't possibly know your error-severity weighting or your SLOs, so it can't measure your value for you. That measurement must be built into the deployment context and instrumented continuously against your own outcomes. ALE is a decent coarse prior to start from, not the answer.
Why The 0% Score Is Actually Useful
Shankar Jothi (VP of Engineering, ML)
I do buy "economically valuable work" as a better yardstick than what we've been using. Abstract benchmarks were always a proxy for a proxy, since we measured whether a model could answer hard questions and hoped that predicted whether it could do a job.
Grounding evaluation in work organizations already pay people for closes that gap and reframes the question that matters to a business: not "how smart is this model," but "can it reliably do the thing we'd otherwise hire for?".
The caveat is Song's own, because labor hours and pay are a decent stand-in for value in operational work but break down at the edges like research, where one result can outweigh years of effort. So it's the best general-purpose lens we have rather than a universal one.
The 0% on the hardest tier reads as discouraging but is actually clarifying. It says the frontier isn't soft, because there's a real class of work needing sustained reasoning and long-horizon reliability where no current model gets there, and no amount of scaffolding changes that today.
"The age of useful agents is here, and the work is being honest about where 'useful' ends."
Shankar Jothi
VP of Engineering, ML
The implication is that "deploy agents" shouldn't be one decision. The useful and not-yet-useful work lives in the same job, often the same workflow, and the discipline right now is telling them apart by being deliberate about which slices you hand off, at what reliability bar, and with what verification around the parts that still need a human.
The maturity question has evolved, and that's the real takeaway.
"Are agents good enough yet?" is binary and mostly unanswerable. But, "Which units of value can we capture now, and at what cost and risk?" is something engineering orgs are well equipped to answer, task by task. The age of useful agents is here, and the work is being honest about where "useful" ends so you can design for that boundary instead of pretending it isn't there.
Uneven Capability Demands Better Routing
Benjamin Thomas (Co-Founder & CTO)
I think the more interesting take away from this piece isn’t that models failed at the highest benchmarking tier, but these benchmarking results show AI capability is unevenly distributed and still highly task-dependent.
While I agree this means we need to make routing non-negotiable, I’d push on it being “simply good engineering,” and its implied simplicity. Routing on cost is simply good engineering. Routing on efficacy is the hard problem, and conflating the two gets you a cheap router that quietly makes worse decisions.
“Cost is the easy 20% and efficacy is the hard 80%.”
Benji Thomas
Co-Founder & CTO
Cost is easy: it’s a published number (price, latency, context window). Lookup table, done.
Efficacy is latent. You don’t know whether model A or B is better at this task until you’ve run it and verified the output, and that answer drifts every time a new version ships. A static “GPT for X, Claude for Y” table is stale the moment the next release lands.
So cost is the easy 20% and efficacy is the hard 80%, where you have to classify the task, learn which model actually wins on it, verify the output, and send the work models fail at to a human.
The plumbing (routing on cost and latency) you can buy off the shelf. The efficacy layer is where it gets interesting: you want a product that learns from your existing data and continuously suggests the optimal route as your task mix and models change.
A static routing config decays when the frontier evolves. A system that keeps optimizing against verified outcomes is the one that keeps pace with change.
What Intelligence Benchmarking Is Missing
Suresh Mathew (Founder & CEO)
Yes, economic value is the right proxy and it’s overdue.
Benchmarks like MMLU tell you a model is smart, but ALE asks if it’s worth paying for. That’s a much harder bar, and the 0% on the hardest tier shows why the intelligence vs. economic discourse matters.
For enterprises, the ROI reframe is simple: stop asking “Can AI do X?” and start asking “At what tier of X, and with what human backstop?” This compresses the time before an expert needs to step in rather than fully replacing them.
"The 0% on the hardest tier shows why the intelligence vs. economic discourse matters."
We see this at Sedai. Agents handling routine cloud optimization like rightsizing, cost anomalies, and scaling decisions deliver real economic value daily because we scope them to where they actually excel and drive results, not because they’re job-ready across the board.
Economic value tied to hours, expertise, and output is exactly how enterprises should be evaluating AI ROI right now, not capability in the abstract.
Stop guessing where to route your models. See how Sedai does that for you.