Wauldo

🛡️ wauldo

The trust score framework for RAG

Your LLM passes demos. It fails in production.

We built the layer that catches what every other framework ships.

_{Python · TypeScript · Rust · MIT · local-first · daily-refreshed leaderboard}

🔥 What the leaderboard shows

We ran 70 adversarial tests against 6 popular RAG frameworks. Same LLM, same embedder, same retrieval config — only the framework changes. The result:

Rank	Framework	Overall	Injection	Contradiction
🥇	Wauldo	97 %	88 %	100 %
🥈	Vanilla LLM	86 %	68 %	100 %
🥉	CrewAI	71 %	48 %	58 %
4	Haystack	60 %	36 %	33 %
4	LangChain	60 %	36 %	25 %
6	LlamaIndex	46 %	48 %	8 %

Adding a RAG framework often makes things worse. The second-best finisher is no framework at all — just stuffing sources into a prompt beats LangChain, LlamaIndex and Haystack on adversarial robustness.

→ See the full leaderboard · → Run it yourself · → Read the methodology

🛠️ What we ship

🏆

Leaderboard

Adversarial bench. 6 RAG frameworks. 70 tests. Daily refresh. Open-source, MIT.

🦀

ragrs

Fast local RAG in Rust. BM25 + FTS5 + sentence chunking. Optional --verify flag for trust scoring.

📚

Awesome list

Curated papers, tools and benchmarks on RAG hallucination, prompt injection, and verified generation.

Quickstart

# Leaderboard — 30 s smoke test, no API key needed
git clone https://github.com/wauldo/wauldo-leaderboard.git && cd wauldo-leaderboard
make build && make smoke

# ragrs — local RAG CLI, index + query + optional trust verification
cargo install ragrs
ragrs index ./docs && ragrs query "your question here" --verify

🧰 SDKs

from wauldo import guard

# Wrap any LangChain / LlamaIndex / Haystack output with a trust score
result = guard(answer=llm_output, sources=retrieved_sources)

# result.trust_score → 0.0 … 1.0
# result.verdict    → "SAFE" | "CONFLICT" | "UNVERIFIED" | "BLOCK"
# result.reason     → "contradiction between src[1] and src[2]"

Repos: Python · TypeScript · Rust

🛡️ How the verification layer works

Three deterministic controls on top of any existing RAG pipeline — not another framework, a layer you plug into the output.

1. Pre-LLM source filter	2. Post-LLM verify	3. Numeric trust score
Every retrieved chunk is classified as `data` or `instruction`. Documents with forged `ADMIN:` markers, imperatives or hidden overrides get stripped before they reach the model.	The answer is fact-checked against the sources that actually reached the model. Deterministic token overlap + structural comparison. No LLM-as-judge, no randomness.	Every answer returns `trust_score ∈ [0, 1]` + a verdict: `SAFE`, `CONFLICT`, `UNVERIFIED`, `BLOCK`. Your app decides what to do with low-trust responses.

📈 Numbers

Metric	Value
Adversarial pass rate	97 % (67 / 70)
Hallucination rate	0 % across 100+ bench runs
Prompt injection resistance	88 % (vs 36 % LangChain)
Contradiction detection	100 % (vs 25 % LangChain)
Frameworks benchmarked	6 — daily refresh
SDK registries	PyPI · npm · crates.io
License	MIT — dataset, scorer, SDKs, CLIs
Stack	Rust (17 crates), local-first

🔗 Links

wauldo.com · Leaderboard · Benchmarks · Guard · Docs · Demo · @wauldoAI

_{Model-agnostic pipeline: performance is driven by verification, not model size. Built by developers who got tired of watching their agent confidently ship wrong answers to real users.}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wauldo

🛡️ wauldo

The trust score framework for RAG

🔥 What the leaderboard shows

🛠️ What we ship

🏆

Leaderboard

🦀

ragrs

📚

Awesome list

Quickstart

🧰 SDKs

🛡️ How the verification layer works

📈 Numbers

🔗 Links

Pinned Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!