Most RAG Problems Are R(etrieval) Problems
Most RAG blog posts read like product brochures. After building a few systems over the last months and reading way too many production post-mortems, I’m pretty convinced the LLM is usually not the thing that breaks first.
Especially not in EU mid-market deployments.
A few failure modes I see again and again:
1. Retrieval quality falls apart somewhere between 10K and 40K docs
The demo with 500 PDFs looks amazing.
Then the first real pilot starts, somebody uploads 30k documents from SharePoint and suddenly top-3 retrieval becomes semi-random.
Typical example:
Query is Lieferantenbewertung 2024.
What comes back:
- a supplier evaluation form from 2019
- three meeting notes because they contain the word “Lieferant”
- the actually correct document maybe at rank 4 or 5
This problem is way more common than most tutorials mention.
What people in production seem to converge on:
- hybrid retrieval (BM25 + dense)
- reciprocal rank fusion
- reranker on top (Cohere if budget exists, BGE reranker otherwise)
- separate indexes per document type
Honestly, adding a reranker solved more quality issues for us than changing the LLM ever did.
2. German enterprise PDFs are completely cursed
Most demos run on clean PDFs.
Real document stores are:
- scanned contracts from 1998
- supplier manuals with 3-column layouts
- rotated tables
- faxed quality reports
- old encodings destroying umlauts
pypdf turns many of these into complete garbage text.
Things I saw multiple times already:
übecoming weird symbols- tables flattened into unreadable prose
- footnotes injected into random sentences
- OCR artifacts treated as actual content
Current stack that works reasonably okay:
- Marker for most docs
- Docling as fallback
- VLM pass for ugly tables
This preprocessing layer is very unsexy work, but probably 30% of the actual implementation effort.
And if you skip it, the whole RAG quality later becomes fake-good.
3. Hallucinations are not the real production problem
Every stakeholder asks: “What about hallucinations?”
Almost nobody asks: “What if the source itself is outdated?”
This kills more pilots from what I’ve seen.
The model gives a perfectly grounded answer. It cites the right document. The document is just no longer valid.
Or worse: two valid documents disagree and the system confidently picks one.
What seems to work:
- recency decay in retrieval scoring
- contradiction checks across retrieved chunks
- confidence thresholds + human handoff
A lot of “hallucination problems” are actually retrieval problems wearing a fake mustache.
4. Permissions become a disaster very fast
This one appears in basically every internal rollout thread.
The assistant accidentally answers something using a HR spreadsheet or salary export the user should never have seen.
Technically the solution is easy: permission filtering before semantic retrieval.
In reality:
- SharePoint permissions are ancient
- metadata missing
- nobody knows document ownership anymore
- legal says ask IT
- IT says ask department head
- department head left in 2021
In EU environments this becomes even more annoying because GDPR changes this from “oops” into potential reportable incident territory.
Honestly I would not even start a pilot anymore before the customer can explain who should access what.
5. Re-embedding costs are massively underestimated
Everybody budgets the first embedding run.
Almost nobody budgets:
- daily delta updates
- re-embedding after model upgrades
- vector storage growth
- multi-vector indexing
Embedding APIs look cheap until somebody realizes the SharePoint dump contains 800 million tokens.
What seems to become the default setup now:
- local embedding models after ~10k docs
- incremental indexing pipelines from day one
- embedding model versioning in metadata
Otherwise migrations become pain very quickly.
The EU / German Mittelstand angle
This changes the architecture more than many US blog posts suggest.
On-premise is usually the default ask now.
GDPR + Art. 28 contracts eliminate half the providers immediately. Most legal departments only accept a very small shortlist without months of discussions.
Also: right-to-erasure with vector DBs is more annoying than many teams expect. If embeddings are derived from customer documents, you need to know exactly where they are.
Still feels like many teams underestimate how much “boring infrastructure work” is inside production RAG systems.
The LLM part is honestly often the easiest component.
If you want a longer version with concrete vendor breakdowns and cost ranges, we wrote one up here: RAG mit eigenen Daten (in German). The broader take on agentic AI in EU-regulated environments: KI-Agenten im Mittelstand 2026.