AIs would silently degrade long documents during complex tasks. To limit these excesses, it will be necessary to divide the work into verifiable micro-tasks, add human control, RAG and strict observability.
When an AI has to keep in mind thousands of pages, entire code bases or hours of audio transcriptions, we are talking about long context. In this case, the AI must be able to process a massive amount of information in one go.
However, according to one recent study from Microsoft Researchthe LLMs on the market left to their own devices on long tasks, silently corrupt nearly 25% of company documents. By chaining 20 consecutive interactions on each document, the researchers measured the extent to which the document degrades over time. The Microsoft Research study reveals that even cutting-edge models like Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupt an average of 25% of document content on long, delegated workflows. Other models may fail even more severely. The average of all models is around 50%. Python is doing well by reaching the threshold of 98% fidelity.
The reliability of LLM thus decreases significantly with the complexity and length of the tasks. Fortunately, there are different solutions to minimize risk and ensure the reliability of AI systems.
1. Break complex tasks into verifiable micro-tasks
According to the Microsoft Research study, each increment of 1,000 additional tokens worsens the degradation by approximately 3.6% on average after 20 interactions. AI can lose track, omit important details, or even make up information to fill in the gaps.
To overcome this, it is recommended to break the process into small steps. Each must have a clear and verifiable objective. For example, for a task like writing a report from multiple sources, the AI can be asked to first summarize each source individually. Then, compare the summaries. And, finally, to summarize the comparisons. A study entitled “An Approach for Systematic Decomposition of Complex LLM Tasks“, published in October 2025, showed that complexity-based task decomposition can improve the accuracy of language models by 9-40%, compared to a standard “Chain-of-Thought” approach.
In the case of film systems, each agent works on a small part of the problem. This reduces the risk of overall error. For onboarding, for example, the inspecting agent checks the presence of administrative files in a folder. The “HR Secretary” officer takes the inspector’s raw report and turns it into a kind message.
Still in this example, by launching the AI agents for onboarding, we see that the inspecting agent reports the missing document. Afterwards, the “HR Secretary” agent writes his email ready to send. He asks the employee for the mutual insurance certificate, while welcoming him. This should ensure better control and greater precision.
2. Implement human-in-the-loop feedback and verification loops
Document corruption on long contexts is insidious. The AI does not report its errors. Verification mechanisms can be used to examine the consistency and accuracy of responses at different stages. The study “Applications, challenges, and future directions of human-in-the-loop learning” highlights that human-in-the-loop systems can achieve high accuracy in reliability detection (0.95).
When requesting to create a document, you can for example ask the AI to create a draft requiring proofreading and validation at different stages. For the agentic system, we can insert “checkpoints” into the Python code. This is so that human proofreading can ensure the quality of the response.
For example, again in the example of onboarding, we can insert a code which brings a human confirmation step after the result of the inspecting agent. To do this, we add this code at the end of the first agent’s task in the verif_onboarding.py file; which is in a way the brain of the system
# --- ÉTAPE DE CONFIRMATION HUMAINE ---
# On lance d'abord uniquement l'Inspecteur pour valider son rapport avant de continuer.
equipe_verif = Crew(
agents=[inspecteur],
tasks=[mission_verif]
)
rapport_inspecteur = equipe_verif.kickoff()
print("n--- RAPPORT BRUT DE L'INSPECTEUR ---")
print(rapport_inspecteur)
validation = input("nCe rapport est-il exact ? Voulez-vous envoyer le message à l'employé ? (oui/non) : ")
if validation.lower() != 'oui':
print("Action annulée. Aucun message n'a été envoyé à l'employé.")
print("Conseil : vérifiez manuellement le dossier 'onboarding_jean/' et relancez le script.")
else:
After triggering the system, we see that the Inspector agent has correctly identified that mutual.txt is missing from the file. The code has stopped execution and is waiting for human validation before launching the Secretary. This is the expected behavior.
3. Use RAG (Retrieval Augmented Generation)
If their training data is outdated or incomplete, LLMs can invent answers. The Microsoft Research study shows (even) that LLMs can forget key information as the context lengthens.
In this context, providing recent and reliable information from its own databases can be wise. In the case of RAG, only relevant information is provided to the LLM as context to generate its response. This significantly reduces hallucinations. The search “Long-context LLMs meet RAG: Overcoming challenges for long inputs in RAG” suggests that RAG can complement the capabilities of long-context LLMs by providing them with a more focused and reliable information retrieval mechanism
For example, for our RAG chatbot with text on the Eiffel Towerwe replace the latter with a file on the shipping and delivery policy for fictitious products, white shirts in this case:
“Preparation of the order: All orders are prepared and dispatched within 24 to 48 working hours (excluding weekends and public holidays). Delivery methods and times: Standard Delivery: 3 to 5 working days to your home. Express Delivery: 24 to 48 hours to your home (for any order placed before noon).”
We replace the contents of `data/document.txt. then we run `python app.py` in PowerShell. When a user questions the chatbot, it resumes appropriately with the text provided.
4. Set up detailed logging of AI interactions
Identifying when and how silent corruption occurs may require careful monitoring. Recording requests sent, responses received, human validations and corrections made will allow AI performance to be audited. According to an analysis of best practices for AI debugging in 2025AI-based debugging tools saw their problem resolution rates improve from 4.4% in 2023 to 69.1% in 2025. Although this figure is not directly attributed to logging alone, detailed logging allows these tools to work effectively.
We can go through a observability platform such as Datadog LLM Observability, LangSmith or Arize Phoenix. These make it possible to monitor, understand and resolve problems with AI systems throughout their lifecycle. Mastering them allows you to treat an LLM pipeline rigorously.
We can also record interactions with the AI by integrating a logging system into the application. Every call to the AI, every data processed, and every human validation is time-stamped and recorded. This allows performance to be audited, error patterns to be identified, and the system to be continually improved.
For example, in the case of onboarding agents, we configure a logging system in the verif_onboarding.py file. Once the command is applied, we run Get-Content onboarding.log in PowerShell to view the entire file. We see that the present log reveals a connection error. The AI tries to contact OpenAI (Failed to connect to OpenAI API) and CrewAI’s telemetry server. But she fails several times. This makes it possible to exonerate the “memory” of the AI to point out an infrastructure problem, such as the internet connection or the OpenAI servers.




