Your LLM passes demos. It fails in production.
We built the layer that catches what every other framework ships.
Python · TypeScript · Rust · MIT · local-first · daily-refreshed leaderboard
We ran 70 adversarial tests against 6 popular RAG frameworks. Same LLM, same embedder, same retrieval config — only the framework changes. The result:
| Rank | Framework | Overall | Injection | Contradiction |
|---|---|---|---|---|
| 🥇 | Wauldo | 97 % | 88 % | 100 % |
| 🥈 | Vanilla LLM | 86 % | 68 % | 100 % |
| 🥉 | CrewAI | 71 % | 48 % | 58 % |
| 4 | Haystack | 60 % | 36 % | 33 % |
| 4 | LangChain | 60 % | 36 % | 25 % |
| 6 | LlamaIndex | 46 % | 48 % | 8 % |
Adding a RAG framework often makes things worse. The second-best finisher is no framework at all — just stuffing sources into a prompt beats LangChain, LlamaIndex and Haystack on adversarial robustness.
→ See the full leaderboard · → Run it yourself · → Read the methodology
|
Adversarial bench. 6 RAG frameworks. 70 tests. Daily refresh. Open-source, MIT. |
Fast local RAG in Rust. BM25 + FTS5 + sentence chunking. Optional |
Curated papers, tools and benchmarks on RAG hallucination, prompt injection, and verified generation. |
# Leaderboard — 30 s smoke test, no API key needed
git clone https://github.com/wauldo/wauldo-leaderboard.git && cd wauldo-leaderboard
make build && make smoke
# ragrs — local RAG CLI, index + query + optional trust verification
cargo install ragrs
ragrs index ./docs && ragrs query "your question here" --verifyfrom wauldo import guard
# Wrap any LangChain / LlamaIndex / Haystack output with a trust score
result = guard(answer=llm_output, sources=retrieved_sources)
# result.trust_score → 0.0 … 1.0
# result.verdict → "SAFE" | "CONFLICT" | "UNVERIFIED" | "BLOCK"
# result.reason → "contradiction between src[1] and src[2]"Repos: Python · TypeScript · Rust
Three deterministic controls on top of any existing RAG pipeline — not another framework, a layer you plug into the output.
| 1. Pre-LLM source filter | 2. Post-LLM verify | 3. Numeric trust score |
|---|---|---|
Every retrieved chunk is classified as data or instruction. Documents with forged ADMIN: markers, imperatives or hidden overrides get stripped before they reach the model. |
The answer is fact-checked against the sources that actually reached the model. Deterministic token overlap + structural comparison. No LLM-as-judge, no randomness. | Every answer returns trust_score ∈ [0, 1] + a verdict: SAFE, CONFLICT, UNVERIFIED, BLOCK. Your app decides what to do with low-trust responses. |
| Metric | Value |
|---|---|
| Adversarial pass rate | 97 % (67 / 70) |
| Hallucination rate | 0 % across 100+ bench runs |
| Prompt injection resistance | 88 % (vs 36 % LangChain) |
| Contradiction detection | 100 % (vs 25 % LangChain) |
| Frameworks benchmarked | 6 — daily refresh |
| SDK registries | PyPI · npm · crates.io |
| License | MIT — dataset, scorer, SDKs, CLIs |
| Stack | Rust (17 crates), local-first |
wauldo.com · Leaderboard · Benchmarks · Guard · Docs · Demo · @wauldoAI
Model-agnostic pipeline: performance is driven by verification, not model size. Built by developers who got tired of watching their agent confidently ship wrong answers to real users.