A new paper presented at ICLR 2026 reveals an unsettling finding for enterprises deploying AI agents: the harder you train a model to reason, the more likely it is to hallucinate tool calls. For organizations running AI agents in HR, payroll, and recruiting workflows, this directly challenges the assumption that newer, smarter models are inherently more reliable.
The Core Finding
The research, titled “The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination”, introduced a diagnostic benchmark called SimpleToolHalluBench that tests a single capability: when given a task an agent cannot complete, does it refuse—or does it fabricate a tool call that doesn’t exist?
The results showed that as reinforcement learning trains agents to reason better, task performance improves—but so does the rate of hallucinated tool calls. The two move together, not against each other. Stronger reasoning does not suppress the model’s tendency to invent tools. Instead, it amplifies it.
The paper found that reasoning RL “disproportionately collapses tool-reliability-related representations,” with the divergences concentrating in the late layers of the neural network—the exact layer that should restrain bad tool calls.
Why This Matters for Enterprise AI
This finding is particularly critical for HR teams, where agents interact with ATS APIs, payroll systems, benefits platforms, and identity management tools. A hallucinated employee ID, a phantom benefits enrollment, or a fabricated job requisition can propagate silently through multi-agent workflows.
Deloitte’s State of AI in the Enterprise research found that 47% of enterprise AI users had based at least one major business decision on hallucinated content. Meanwhile, OutSystems’ 2026 survey of nearly 1,900 IT leaders found that 96% of enterprises run AI agents in production, but only 12% have a central platform to manage them.
What Enterprises Should Do
For teams evaluating AI agents, the practical takeaway is clear: the marketing axis—smarter and better at reasoning—is not the buying axis. The buying axis is reliability and tool restraint. Every vendor pilot should include a test where the correct answer is “I cannot do this, please escalate.”
Enterprises should add three checks to their AI evaluation framework:
- Run a “no-tool” test on every agent in the shortlist—ask it to perform a task with the tool removed, then observe whether it refuses or invents one
- Require vendors to expose tool-call logs for post-deployment auditing
- Isolate agents that touch payroll or benefits automation behind a human approval step until tool reliability is independently measured
The paper frames this as a “fundamental reliability-capability trade-off”—meaning today’s reasoning-enhancement methods were not designed to jointly optimize accuracy and tool restraint. Until that changes, enterprises need to test for the capability they actually need: knowing when not to act.