Dataset Β Β·Β Benchmark Spec Β Β·Β Official Release Β Β·Β Evaluation Protocol Β Β·Β Related Work Β Β·Β Dataset Card Β Β·Β Leaderboard Policy Β Β·Β Integration Guide
Engram is a runtime benchmark for evaluating long-term memory in AI agents. It measures memory behavior after the original context window is gone: whether an agent can recover grounded, specific knowledge from prior sessions, fall back to only partial recollection, abstain safely, or hallucinate.
Unlike static QA or retrieval-only tests, Engram runs inside the agent runtime itself: it seeds multi-turn conversation histories, waits for memory processing to settle, then probes recall in a fresh session with no haystack in context. Whatever memory architecture the agent actually uses is what gets measured.
Engram is intended to be benchmark-first and system-neutral. The benchmark defines the task format, runtime protocol, scoring rubric, and artifact requirements; systems such as OpenClaw, OpenClaw memory-plugin variants, or any third-party agent are evaluated against the same procedure.
The benchmark is system-neutral, but this repository currently focuses on a practical OpenClaw evaluation lineup built around one baseline runtime and four memory-system integrations:
| Benchmark track | Runtime setup | Upstream project | Notes |
|---|---|---|---|
baseline |
OpenClaw reference agent with the standard benchmark workspace and no additional benchmark-specific memory augmentation | OpenClaw | Use this as the no-augmentation control row. In practice, teams often keep only the standard OpenClaw memory-core or session-memory path enabled here. |
mem0 |
OpenClaw with the Mem0-backed memory plugin enabled | serenichron/openclaw-memory-mem0 | Mem0-backed semantic memory via a self-hosted Mem0 REST API. |
clawvault |
OpenClaw with ClawVault installed and wired into the runtime | Versatly/clawvault | Structured, local-first memory with markdown storage, graph-aware retrieval, and session lifecycle primitives. |
lossless-claw |
OpenClaw with the Lossless-Claw context engine enabled | @martian-engineering/lossless-claw | DAG-based lossless context compaction and expansion under the same OpenClaw runtime family. |
cortex |
OpenClaw with the Cortex memory plugin enabled | Ubundi/openclaw-cortex | Knowledge-graph-oriented long-term memory with auto-recall, auto-capture, and direct memory tools. |
Important: Engram does not install, enable, or configure these memory systems for you. The benchmark measures whatever runtime state your agent already exposes. The --condition flag labels the evaluated configuration and enables benchmark-side behavior where implemented, such as Cortex preflight/date handling and condition-aware settle defaults.
Engram is designed to answer three benchmark questions:
- Can an agent retrieve grounded project details from prior sessions?
- Can it preserve rationale, evolution, and cross-session synthesis rather than only isolated facts?
- How does it trade off grounded recall, abstention, and hallucination under a fixed runtime protocol?
Engram is not trying to reduce memory evaluation to binary QA accuracy. The benchmark intentionally distinguishes four outcomes: grounded recall, generic but underspecified recall, abstention, and hallucinated specificity.
Official benchmark artifacts for the current public release:
| Artifact | Value |
|---|---|
| Benchmark release | Engram v3.0 (engram-v3.0) |
| Tasks | 498 |
| Question types | 9 |
| Primary metric | Mean memory-quality judge score (0-3) |
| Secondary metrics | Grounded rate, hallucination rate, abstention rate, per-category scores |
| Official protocol | Seed -> Settle -> Probe -> Judge (engram-runtime-v1) |
Engram v3.0 has a frozen official public setting for benchmark-comparable runs:
- Split:
v3 - Evaluated answer model: must be disclosed and recorded for every run
- Judge model:
gpt-4.1-mini - Judge passes:
3 - Judge temperature:
0.3 - Required artifacts:
metrics.json,run_metadata.json,predictions.jsonl,seed_turns.jsonl,probes.jsonl,judgments.jsonl
See docs/benchmark_release_v3.md for the full release policy, including condition-specific settle defaults and reporting requirements.
Engram v3 contains 498 tasks spanning 9 question types, targeting failure modes that commonly stress long-term agent memory systems:
| Category | Count | What it tests |
|---|---|---|
multi-session |
79 | Facts requiring information from multiple separate conversations |
temporal-reasoning |
78 | Ordering and recency β distinguishing current from historical facts |
cross-agent-memory |
71 | Knowledge shared or referenced across different agent instances |
multi-hop-reasoning |
68 | Connecting facts via intermediate entities across the session corpus |
recurring-pattern |
54 | Conventions and patterns established repeatedly across sessions |
knowledge-update |
53 | Tracking how facts evolved β decisions reversed or revised over time |
single-session-user |
45 | Direct recall of specifics stated by the user in a single session |
single-session-assistant |
32 | Recall of specifics stated by the assistant in a single session |
fact-recall |
18 | Direct retrieval of a single grounded specific fact |
The Engram v3 dataset is hosted on HuggingFace and fetched automatically on first run.
from benchmark.tasks.hf import fetch_engram_dataset
path = fetch_engram_dataset() # downloads and caches locally| Property | Value |
|---|---|
| Tasks | 498 |
| Avg haystack sessions per task | 3.0 |
| Avg haystack turns per task | 30.1 |
| Question types | 9 |
| Format | JSON |
| HuggingFace | matthewschramm/engram-v3 |
The dataset is public β no authentication required.
- Python
3.10+ uvfor environment management and repeatable installsopenclawonPATHonly if you plan to use--agent openclaw
curl -LsSf https://astral.sh/uv/install.sh | sh
source "$HOME/.local/bin/env"
uv sync --devThis creates a local virtualenv and installs the package plus development dependencies. After that, run commands with uv run .... The package also exposes a console entry point named benchmark-run.
If you prefer an activated virtualenv workflow:
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"uv run benchmark-run --agent local_stub --dry-run --max-tasks 5Useful local helpers:
make runruns a local-stub benchmark intooutputs/scripts/run_dev.shruns the lightweightdevsplitscripts/run_test.shruns a tinytest-split smoke test
Start your agent server or CLI-backed runtime, then point the benchmark at it:
JUDGE_API_KEY="<key>" uv run benchmark-run --agent http://localhost:8080Engram seeds memory sessions into the agent, waits for memory processing to settle, probes recall in a fresh session, and judges responses with an LLM. If JUDGE_API_KEY is omitted, the run still executes but LLM judging is skipped.
Supported adapter entry points today:
local_stubfor deterministic offline smoke testsopenclawfor the OpenClaw CLI adapterhttp://...orhttps://...for a custom agent servercodexandopenaiexist as scaffold stubs and are not benchmark-ready yet
See docs/integration_guide.md for the HTTP server contract, the OpenClaw CLI adapter, and a custom Python adapter option.
make format
make lint
make test
make check
make fetch
make fetch-testuv run python scripts/validate_submission.py outputs/<run-id>This validates the required leaderboard artifacts and checks the official-release metadata fields.
On a fresh Ubuntu EC2 instance, the systemd --user daemon is not started by default. The OpenClaw installer runs as a user-level systemd service, and its final health check will crash if the daemon isn't initialized β even though the binaries installed correctly.
Before running the OpenClaw installer, run:
sudo loginctl enable-linger $USERThen disconnect and reconnect your SSH session so PAM generates the correct D-Bus environment variables. After reconnecting, run the OpenClaw installer as normal.
Already installed and it crashed?
If the installer failed with a systemctl is-enabled unavailable error, you don't need to wipe the server:
# Add OpenClaw to PATH
export PATH="/home/ubuntu/.npm-global/bin:$PATH"
# Repair the missing service files
openclaw doctor --repair
# Reload and start the gateway
systemctl --user daemon-reload
openclaw gateway restartAfter installing OpenClaw, the first run opens an interactive TUI where the agent asks you to define its identity. To ensure every benchmark instance starts from the same baseline, use these answers:
| Prompt | Answer |
|---|---|
| Onboarding mode | QuickStart |
| Model | anthropic/claude-sonnet-4-6 |
| Channel | Skip for now |
| Configure skills? | Yes (skip all API key prompts) |
| Enable hooks? | boot-md, session-memory |
| How do you want to hatch? | Hatch in TUI |
For the canonical OpenClaw reference track used in this repo, keep this answer model fixed across compared conditions and record it in the benchmark run with --answer-model anthropic/claude-sonnet-4-6.
Once the TUI opens and the agent says "Who am I?", send these messages in order:
Message 1 β Identity:
Your name is Benchmark. You are a memory evaluation agent. Your emoji is π. Your vibe is neutral and precise β no personality flourishes, just clear and direct responses. Call me Operator.
Message 2 β Purpose:
You will be used to benchmark long-term memory recall. Conversations will be seeded into you, and then you'll be asked questions about them in a fresh session. Answer questions directly from what you remember. If you don't remember, say so honestly. Do not guess or hallucinate.
Message 3 β Finalize:
Update IDENTITY.md and USER.md now. Delete BOOTSTRAP.md when done. Don't modify SOUL.md.
Wait for the agent to confirm it has written the files, then exit the TUI (Ctrl+C).
Alternative: copy template files directly (skips TUI hatching)
If you prefer to skip the interactive hatching entirely, copy the benchmark workspace templates into the OpenClaw workspace:
cp engram-benchmark/workspace-templates/IDENTITY.md ~/.openclaw/workspace/IDENTITY.md
cp engram-benchmark/workspace-templates/USER.md ~/.openclaw/workspace/USER.md
rm -f ~/.openclaw/workspace/BOOTSTRAP.mdcurl -LsSf https://astral.sh/uv/install.sh | sh
source "$HOME/.local/bin/env"
git clone https://github.com/Ubundi/engram-benchmark.git && cd engram-benchmark
uv sync --devConfirm everything is wired up before starting a real run:
uv run benchmark-run --agent local_stub --dry-run --max-tasks 3Benchmark runs take hours. Always run inside tmux so a disconnected SSH session doesn't kill the process:
tmux new -s benchmarkIf you get disconnected, reconnect with tmux attach -t benchmark.
Use a distinct OpenClaw agent or workspace per benchmark condition, or perform a verified full memory reset between runs. Reusing the same agent state across conditions will contaminate comparisons.
Suggested pattern:
baseline-agent-idcortex-agent-idmem0-agent-idclawvault-agent-idlossless-claw-agent-id
Reference baseline:
JUDGE_API_KEY="<your-openai-key>" uv run benchmark-run \
--agent openclaw \
--agent-id <baseline-agent-id> \
--answer-model anthropic/claude-sonnet-4-6 \
--condition baseline \
--output-dir outputs/baselineCortex:
JUDGE_API_KEY="<your-openai-key>" uv run benchmark-run \
--agent openclaw \
--agent-id <cortex-agent-id> \
--answer-model anthropic/claude-sonnet-4-6 \
--condition cortex \
--output-dir outputs/cortexMem0:
JUDGE_API_KEY="<your-openai-key>" uv run benchmark-run \
--agent openclaw \
--agent-id <mem0-agent-id> \
--answer-model anthropic/claude-sonnet-4-6 \
--condition mem0 \
--output-dir outputs/mem0ClawVault:
JUDGE_API_KEY="<your-openai-key>" uv run benchmark-run \
--agent openclaw \
--agent-id <clawvault-agent-id> \
--answer-model anthropic/claude-sonnet-4-6 \
--condition clawvault \
--output-dir outputs/clawvaultLossless-Claw:
JUDGE_API_KEY="<your-openai-key>" uv run benchmark-run \
--agent openclaw \
--agent-id <lossless-claw-agent-id> \
--answer-model anthropic/claude-sonnet-4-6 \
--condition lossless-claw \
--flush-sessions \
--output-dir outputs/lossless-clawuv run benchmark-run \
--agent local_stub \
--compare outputs/baseline/<run-id> outputs/cortex/<run-id>The JUDGE_API_KEY is an OpenAI-compatible API key used by the LLM judge (defaults to gpt-4.1-mini). Run IDs are printed at the start of each run and visible as directory names under outputs/. Offline comparisons also write a Markdown report such as comparison-baseline-vs-cortex.md next to the compared runs.
Useful flags:
| Flag | Default | Description |
|---|---|---|
--condition NAME |
β | Condition label (baseline, mem0, clawvault, lossless-claw, cortex). Records the evaluated memory configuration and enables any condition-specific benchmark behavior |
--agent-id ID |
β | OpenClaw agent ID (passed to openclaw agent --agent) |
--answer-model MODEL |
β | Evaluated model used to generate answers; keep fixed across controlled comparisons |
--settle-seconds N |
auto | Wait between seed and probe (cortex=180s, mem0=60s, lossless-claw=30s, baseline/clawvault=10s, other=120s) |
--judge-passes N |
3 | LLM judge passes per response (scores averaged) |
--judge-concurrency N |
4 | Parallel judge workers |
--flush-sessions |
β | Send /new after each seed session to trigger memory hooks |
--skip-seed |
β | Skip seeding; probe a pre-seeded agent only |
--max-tasks N |
β | Run a subset of N tasks |
--judge-model |
gpt-4.1-mini |
Judge model name |
--openclaw-timeout N |
120 | Timeout in seconds for openclaw agent CLI calls |
--compare DIR_A DIR_B |
β | Compare two run directories offline and write a Markdown comparison report (still requires --agent because of CLI parsing) |
Engram uses a four-phase pipeline:
Seed β Settle β Probe β Judge
- Seed β Replay haystack sessions into the agent turn-by-turn via the agent runtime
- Settle β Wait for memory indexing and async processing to complete (cortex: 180s, baseline: 10s)
- Probe β Ask evaluation questions in a fresh session with no haystack in context
- Judge β Score responses 0β3 against ground truth using a multi-pass LLM judge
Scoring rubric:
| Score | Label | Description |
|---|---|---|
| 3 | Grounded correct | Cites the specific detail from the haystack |
| 2 | Generic correct | Right direction, missing the specific |
| 1 | Abstained | Honest "I don't have that context" |
| 0 | Hallucinated | Wrong specific stated with confidence |
See docs/evaluation_protocol.md for full protocol specification.
The table below is a reference example showing how Engram reports results for one evaluated runtime family. It is not the definition of the benchmark, and it should not be read as the only intended use of Engram.
Reference run results on a live OpenClaw agent, reported on March 4, 2026. Scores are on a 0-3 scale.
| Condition | Overall | Rationale | Synthesis | Evolution | Temporal | Grounded | Abstained |
|---|---|---|---|---|---|---|---|
| Baseline (native memory only) | 1.10 | 1.93 | 0.54 | 1.07 | 0.62 | 4% | 64% |
| Memory-augmented | 1.95 | 3.00 | 2.00 | 2.10 | 0.67 | 48% | 12% |
| Ξ | +0.85 | +1.07 | +1.46 | +1.03 | +0.05 | +44pp | β52pp |
Key findings:
- Rationale recall reaches 3.00 β the reasoning behind decisions is fully preserved with memory augmentation
- Synthesis (facts spanning multiple sessions) improves from near-impossible (0.54) to reliable (2.00)
- Temporal reasoning (+0.05) is the hardest category β semantic retrieval surfaces historical and current facts without reliable recency ranking
- Memory value compounds across runs: a second seeding pass raised overall score from 1.81 to 1.95
Future benchmark reports should include multiple systems or conditions under the same pinned settings. See docs/leaderboard.md for the submission and governance policy. If the evaluated answer model changes, treat that as a separate row or track rather than a controlled delta.
Each run produces outputs/<run_id>/ containing:
| File | Contents |
|---|---|
predictions.jsonl |
Per-task agent responses |
metrics.json |
Aggregate and per-category scores |
run_metadata.json |
Full run configuration, answer model, git commit, provenance |
seed_turns.jsonl |
Seeded conversation turns with latency |
probes.jsonl |
Probe session transcripts with latency |
judgments.jsonl |
Per-response judge scores, rationale, and pass scores |
report.md |
Human-readable Markdown report with full per-probe detail |
Offline comparisons write comparison-<condition-a>-vs-<condition-b>.md next to the compared run directories rather than inside an individual run folder.
engram-benchmark/
βββ benchmark/ CLI, adapters, task loading, judging, reports
β βββ tasks/ Split loading, HuggingFace fetch, schema helpers
β βββ adapters/ Agent adapters: local_stub, http, openclaw, openai, codex
β βββ evaluators/ QA, retrieval, and abstention metrics
β βββ reports/ Run artifact writers and Markdown reports
βββ data/ Schemas, notes, and CI-safe sample splits
βββ docs/ Benchmark spec, protocol, system matrix, integration guide
βββ leaderboard/ Submission format and leaderboard policy
βββ outputs/ Run artifacts (gitignored)
βββ scripts/ Validation and helper entry points
βββ tests/ Import, CLI, adapter, and schema tests
@software{engram2026,
title = {Engram: A Runtime Benchmark for Agent Long-Term Memory Recall},
author = {Ubundi},
year = {2026},
url = {https://github.com/Ubundi/engram-benchmark},
}MIT
Engram is an open-source project by Ubundi β a South African venture studio shaping human-centred AI. Based in Cape Town, Ubundi builds at the intersection of AI capability and African context.
Engram grew out of a need to rigorously measure what memory systems actually retain. Existing benchmarks often emphasize in-context recall; Engram is built to test what survives after the context window is gone. The result is a reproducible runtime evaluation intended for internal benchmarking, public comparison, and eventual community adoption.

