Graphic illustrating AI delegation risks and DELEGATE-52 findings on LLM document corruption

Further Notes on Our Recent Research on AI Delegation and Long-Horizon Reliability: Understanding AI delegation risks

By Agustin Giovagnoli / May 15, 2026

The DELEGATE-52 research examines whether large language models can be trusted as autonomous delegates for long-horizon document editing and knowledge work. The work matters because AI delegation risks shift from giving bad advice to silently altering source records that businesses rely on [1][2][3].

What the DELEGATE-52 study found — quick summary for leaders

DELEGATE-52 spans 52 professional domains and evaluates up to 20-step delegated workflows using chained backtranslation to measure semantic preservation without human annotation [1][2][3]. The team tested 19 models and found that every system degraded documents over long runs. Even top frontier models corrupted roughly 25% of content by the end of extended interactions, with degradation worsening for longer horizons, larger documents, and when distractor files were present [1][2][3].

How long-horizon reliability differs from single-step capability

Short, single-step capability metrics did not predict long-run performance in delegated editing. Models were close to lossless on Python and code workflows, yet proved unsuitable across most other professional domains. This gap indicates that code-heavy benchmarks can overestimate readiness for real-world delegated work where documents serve as systems of record [1][2].

The mechanics of document corruption: rare but severe silent failures

Most damage arose from sparse but severe errors. Roughly 80% of total degradation came from single steps that silently lost or corrupted at least 10% of a document. Stronger models tended to postpone rather than prevent these failures, which makes them hard to detect with standard short tasks or spot checks [1][2][3]. External commentary converges on a workflow integrity framing: the primary hazard is not visible hallucination in answers, but corruption of the underlying source records that workflows depend on [4][5].

Why tools and agentic behavior don’t fix long-horizon corruption

Adding agentic behaviors and file-access tools did not substantially mitigate long-horizon corruption on DELEGATE-52. Tool-augmented agents still suffered from the same failure modes during extended, multi-step editing, underscoring that the core issue is workflow integrity rather than tool availability [1][2]. For an overview of the research context, see the Microsoft Research page [3].

AI delegation risks for businesses

The findings map cleanly to document automation risks. Code editing appears comparatively safe under this benchmark, while most non-code professional documents remain unreliable for autonomous delegation. In regulated or near-zero-tolerance contexts, organizations should keep system-of-record edits under deterministic control or require human verification before changes land in production repositories [1][2][7]. Summaries from practitioners reinforce that the risk profile centers on long-horizon reliability rather than one-off accuracy checks [4][5][6].

Practical mitigations and recommended workflow controls

Emerging practice recommendations emphasize containment and verification:

Constrain delegated tasks to narrow scopes and short horizons to limit cumulative risk [1][2][4].
Use LLMs for robust subtasks such as pairwise ranking instead of direct, multi-step source edits [1][2][6].
Keep humans or specialized systems in control of system-of-record operations, with explicit verification gates before commits [1][2][7].
Audit edits periodically and watch for single-step catastrophic losses, since these account for the bulk of damage [1][2][4].
Prefer privacy-preserving, controlled infrastructures for deployment and review [8].

These controls help reduce exposure to silent failures in LLM workflows while still capturing value from narrow, supervised use cases.

How to test your workflows: applying DELEGATE-52 lessons

Operators can adapt the benchmark’s approach to audit their own pipelines:

Simulate chained edits over longer horizons and track semantic preservation against the original source [1][2][3].
Introduce distractor files to surface error modes that only appear in more realistic file contexts [1][2][3].
Measure where and when content loss spikes, with special attention to single-step drops of 10% or more [1][2][3].
Compare behavior across domains to avoid overgeneralizing from code-heavy tests [1][2].

For hands-on tactics and templates, you can also explore AI tools and playbooks.

Policy and governance alignment: institutional guidance

Broader institutional guidance aligns with assistive use, human review, and controlled infrastructures. University guidance for marketing and communications emphasizes responsible, privacy-preserving use and retaining human oversight, which mirrors the study’s recommendations for high-stakes content and system-of-record operations [8]. Practitioner guidance on building reliable LLM workflows and document automation similarly stresses scoped tasks, verification, and specialized systems where tolerance for error is low [6][7].

Sources

[1] [PDF] LLMs Corrupt Your Documents When You Delegate – arXiv
https://arxiv.org/pdf/2604.15597

[2] LLMs Corrupt Your Documents When You Delegate
https://arxiv.org/html/2604.15597v1

[3] LLMs Corrupt Your Documents When You Delegate – Microsoft Research
https://www.microsoft.com/en-us/research/publication/llms-corrupt-your-documents-when-you-delegate/

[4] AI Document Corruption: 7 Critical Lessons
https://www.progressiverobot.com/2026/05/13/ai-document-corruption/

[5] Microsoft Research finds AI model degradation is quietly corrupting your work documents
https://www.goml.io/blog/microsoft-research-finds-ai-model-degradation

[6] How to Build Reliable LLM-Based Workflows | CompatibL
https://www.compatibl.com/insights/how-to-build-reliable-llm-based-workflows/

[7] The Capabilities and Limitations of Large Language Models in Document Automation | Parseur®
https://parseur.com/blog/llms-document-automation-capabilities-limitations

[8] AI guidelines for marketing and communications | University Communications
https://ucomm.stanford.edu/policies-and-guidance/ai-guidelines-marketing-and-communications