GitHub - Ubundi/engram-benchmark: Benchmark for evaluating long-term memory in AI agents. Seeds conversations, waits for memory processing, then probes recall in a fresh session.

Dataset · Benchmark Spec · Official Release · Evaluation Protocol · Related Work · Dataset Card · Leaderboard Policy · Integration Guide

Engram is a runtime benchmark for evaluating long-term memory in AI agents. It measures memory behavior after the original context window is gone: whether an agent can recover grounded, specific knowledge from prior sessions, fall back to only partial recollection, abstain safely, or hallucinate.

Unlike static QA or retrieval-only tests, Engram runs inside the agent runtime itself: it seeds multi-turn conversation histories, waits for memory processing to settle, then probes recall in a fresh session with no haystack in context. Whatever memory architecture the agent actually uses is what gets measured.

Engram is intended to be benchmark-first and system-neutral. The benchmark defines the task format, runtime protocol, scoring rubric, and artifact requirements; systems such as OpenClaw, OpenClaw memory-plugin variants, or any third-party agent are evaluated against the same procedure.

Evaluated Memory Systems

The benchmark is system-neutral, but this repository currently focuses on a practical OpenClaw evaluation lineup built around one baseline runtime and four memory-system integrations:

Benchmark track	Runtime setup	Upstream project	Notes
`baseline`	OpenClaw reference agent with the standard benchmark workspace and no additional benchmark-specific memory augmentation	OpenClaw	Use this as the no-augmentation control row. In practice, teams often keep only the standard OpenClaw memory-core or session-memory path enabled here.
`mem0`	OpenClaw with the Mem0-backed memory plugin enabled	serenichron/openclaw-memory-mem0	Mem0-backed semantic memory via a self-hosted Mem0 REST API.
`clawvault`	OpenClaw with ClawVault installed and wired into the runtime	Versatly/clawvault	Structured, local-first memory with markdown storage, graph-aware retrieval, and session lifecycle primitives.
`lossless-claw`	OpenClaw with the Lossless-Claw context engine enabled	@martian-engineering/lossless-claw	DAG-based lossless context compaction and expansion under the same OpenClaw runtime family.
`cortex`	OpenClaw with the Cortex memory plugin enabled	Ubundi/openclaw-cortex	Knowledge-graph-oriented long-term memory with auto-recall, auto-capture, and direct memory tools.

Important: Engram does not install, enable, or configure these memory systems for you. The benchmark measures whatever runtime state your agent already exposes. The --condition flag labels the evaluated configuration and enables benchmark-side behavior where implemented, such as Cortex preflight/date handling and condition-aware settle defaults.

What Engram Measures

Engram is designed to answer three benchmark questions:

Can an agent retrieve grounded project details from prior sessions?
Can it preserve rationale, evolution, and cross-session synthesis rather than only isolated facts?
How does it trade off grounded recall, abstention, and hallucination under a fixed runtime protocol?

Engram is not trying to reduce memory evaluation to binary QA accuracy. The benchmark intentionally distinguishes four outcomes: grounded recall, generic but underspecified recall, abstention, and hallucinated specificity.

Official benchmark artifacts for the current public release:

Artifact	Value
Benchmark release	Engram v3.0 (`engram-v3.0`)
Tasks	498
Question types	9
Primary metric	Mean memory-quality judge score (0-3)
Secondary metrics	Grounded rate, hallucination rate, abstention rate, per-category scores
Official protocol	Seed -> Settle -> Probe -> Judge (`engram-runtime-v1`)

Official Benchmark Setting

Engram v3.0 has a frozen official public setting for benchmark-comparable runs:

Split: v3
Evaluated answer model: must be disclosed and recorded for every run
Judge model: gpt-4.1-mini
Judge passes: 3
Judge temperature: 0.3
Required artifacts: metrics.json, run_metadata.json, predictions.jsonl, seed_turns.jsonl, probes.jsonl, judgments.jsonl

See docs/benchmark_release_v3.md for the full release policy, including condition-specific settle defaults and reporting requirements.

Task Categories

Engram v3 contains 498 tasks spanning 9 question types, targeting failure modes that commonly stress long-term agent memory systems:

Category	Count	What it tests
`multi-session`	79	Facts requiring information from multiple separate conversations
`temporal-reasoning`	78	Ordering and recency — distinguishing current from historical facts
`cross-agent-memory`	71	Knowledge shared or referenced across different agent instances
`multi-hop-reasoning`	68	Connecting facts via intermediate entities across the session corpus
`recurring-pattern`	54	Conventions and patterns established repeatedly across sessions
`knowledge-update`	53	Tracking how facts evolved — decisions reversed or revised over time
`single-session-user`	45	Direct recall of specifics stated by the user in a single session
`single-session-assistant`	32	Recall of specifics stated by the assistant in a single session
`fact-recall`	18	Direct retrieval of a single grounded specific fact

Dataset

The Engram v3 dataset is hosted on HuggingFace and fetched automatically on first run.

from benchmark.tasks.hf import fetch_engram_dataset
path = fetch_engram_dataset()  # downloads and caches locally

Property	Value
Tasks	498
Avg haystack sessions per task	3.0
Avg haystack turns per task	30.1
Question types	9
Format	JSON
HuggingFace	matthewschramm/engram-v3

The dataset is public — no authentication required.

Installation

Prerequisites

Python 3.10+
uv for environment management and repeatable installs
openclaw on PATH only if you plan to use --agent openclaw

Recommended setup

curl -LsSf https://astral.sh/uv/install.sh | sh
source "$HOME/.local/bin/env"
uv sync --dev

This creates a local virtualenv and installs the package plus development dependencies. After that, run commands with uv run .... The package also exposes a console entry point named benchmark-run.

Alternative editable install

If you prefer an activated virtualenv workflow:

uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"

Quickstart

1. Dry run (full pipeline smoke test, no external agent required)

uv run benchmark-run --agent local_stub --dry-run --max-tasks 5

Useful local helpers:

make run runs a local-stub benchmark into outputs/
scripts/run_dev.sh runs the lightweight dev split
scripts/run_test.sh runs a tiny test-split smoke test

2. Run against a live agent

Start your agent server or CLI-backed runtime, then point the benchmark at it:

JUDGE_API_KEY="<key>" uv run benchmark-run --agent http://localhost:8080

Engram seeds memory sessions into the agent, waits for memory processing to settle, probes recall in a fresh session, and judges responses with an LLM. If JUDGE_API_KEY is omitted, the run still executes but LLM judging is skipped.

Supported adapter entry points today:

local_stub for deterministic offline smoke tests
openclaw for the OpenClaw CLI adapter
http://... or https://... for a custom agent server
codex and openai exist as scaffold stubs and are not benchmark-ready yet

See docs/integration_guide.md for the HTTP server contract, the OpenClaw CLI adapter, and a custom Python adapter option.

3. Common developer commands

make format
make lint
make test
make check
make fetch
make fetch-test

4. Validate a finished run directory

uv run python scripts/validate_submission.py outputs/<run-id>

This validates the required leaderboard artifacts and checks the official-release metadata fields.

5. Reference runtime: OpenClaw on EC2

Prerequisite: enable systemd user services (fresh instances only)

On a fresh Ubuntu EC2 instance, the systemd --user daemon is not started by default. The OpenClaw installer runs as a user-level systemd service, and its final health check will crash if the daemon isn't initialized — even though the binaries installed correctly.

Before running the OpenClaw installer, run:

sudo loginctl enable-linger $USER

Then disconnect and reconnect your SSH session so PAM generates the correct D-Bus environment variables. After reconnecting, run the OpenClaw installer as normal.

Already installed and it crashed?

If the installer failed with a systemctl is-enabled unavailable error, you don't need to wipe the server:

# Add OpenClaw to PATH
export PATH="/home/ubuntu/.npm-global/bin:$PATH"

# Repair the missing service files
openclaw doctor --repair

# Reload and start the gateway
systemctl --user daemon-reload
openclaw gateway restart

Hatching: standardize the agent identity

After installing OpenClaw, the first run opens an interactive TUI where the agent asks you to define its identity. To ensure every benchmark instance starts from the same baseline, use these answers:

Prompt	Answer
Onboarding mode	QuickStart
Model	anthropic/claude-sonnet-4-6
Channel	Skip for now
Configure skills?	Yes (skip all API key prompts)
Enable hooks?	boot-md, session-memory
How do you want to hatch?	Hatch in TUI

For the canonical OpenClaw reference track used in this repo, keep this answer model fixed across compared conditions and record it in the benchmark run with --answer-model anthropic/claude-sonnet-4-6.

Once the TUI opens and the agent says "Who am I?", send these messages in order:

Message 1 — Identity:

Your name is Benchmark. You are a memory evaluation agent. Your emoji is 📊. Your vibe is neutral and precise — no personality flourishes, just clear and direct responses. Call me Operator.

Message 2 — Purpose:

You will be used to benchmark long-term memory recall. Conversations will be seeded into you, and then you'll be asked questions about them in a fresh session. Answer questions directly from what you remember. If you don't remember, say so honestly. Do not guess or hallucinate.

Message 3 — Finalize:

Update IDENTITY.md and USER.md now. Delete BOOTSTRAP.md when done. Don't modify SOUL.md.

Wait for the agent to confirm it has written the files, then exit the TUI (Ctrl+C).

Alternative: copy template files directly (skips TUI hatching)

If you prefer to skip the interactive hatching entirely, copy the benchmark workspace templates into the OpenClaw workspace:

cp engram-benchmark/workspace-templates/IDENTITY.md ~/.openclaw/workspace/IDENTITY.md
cp engram-benchmark/workspace-templates/USER.md ~/.openclaw/workspace/USER.md
rm -f ~/.openclaw/workspace/BOOTSTRAP.md

Step 1: Install dependencies

curl -LsSf https://astral.sh/uv/install.sh | sh
source "$HOME/.local/bin/env"
git clone https://github.com/Ubundi/engram-benchmark.git && cd engram-benchmark
uv sync --dev

Step 2: Dry run

Confirm everything is wired up before starting a real run:

uv run benchmark-run --agent local_stub --dry-run --max-tasks 3

Step 3: Start a tmux session

Benchmark runs take hours. Always run inside tmux so a disconnected SSH session doesn't kill the process:

tmux new -s benchmark

If you get disconnected, reconnect with tmux attach -t benchmark.

Step 4: Prepare separate benchmark agents or clean resets

Use a distinct OpenClaw agent or workspace per benchmark condition, or perform a verified full memory reset between runs. Reusing the same agent state across conditions will contaminate comparisons.

Suggested pattern:

baseline-agent-id
cortex-agent-id
mem0-agent-id
clawvault-agent-id
lossless-claw-agent-id

Step 5: Run the benchmark tracks

Reference baseline:

JUDGE_API_KEY="<your-openai-key>" uv run benchmark-run \
  --agent openclaw \
  --agent-id <baseline-agent-id> \
  --answer-model anthropic/claude-sonnet-4-6 \
  --condition baseline \
  --output-dir outputs/baseline

Cortex:

JUDGE_API_KEY="<your-openai-key>" uv run benchmark-run \
  --agent openclaw \
  --agent-id <cortex-agent-id> \
  --answer-model anthropic/claude-sonnet-4-6 \
  --condition cortex \
  --output-dir outputs/cortex

Mem0:

JUDGE_API_KEY="<your-openai-key>" uv run benchmark-run \
  --agent openclaw \
  --agent-id <mem0-agent-id> \
  --answer-model anthropic/claude-sonnet-4-6 \
  --condition mem0 \
  --output-dir outputs/mem0

ClawVault:

JUDGE_API_KEY="<your-openai-key>" uv run benchmark-run \
  --agent openclaw \
  --agent-id <clawvault-agent-id> \
  --answer-model anthropic/claude-sonnet-4-6 \
  --condition clawvault \
  --output-dir outputs/clawvault

Lossless-Claw:

JUDGE_API_KEY="<your-openai-key>" uv run benchmark-run \
  --agent openclaw \
  --agent-id <lossless-claw-agent-id> \
  --answer-model anthropic/claude-sonnet-4-6 \
  --condition lossless-claw \
  --flush-sessions \
  --output-dir outputs/lossless-claw

Step 6: Compare results

uv run benchmark-run \
  --agent local_stub \
  --compare outputs/baseline/<run-id> outputs/cortex/<run-id>

The JUDGE_API_KEY is an OpenAI-compatible API key used by the LLM judge (defaults to gpt-4.1-mini). Run IDs are printed at the start of each run and visible as directory names under outputs/. Offline comparisons also write a Markdown report such as comparison-baseline-vs-cortex.md next to the compared runs.

Useful flags:

Flag	Default	Description
`--condition NAME`	—	Condition label (`baseline`, `mem0`, `clawvault`, `lossless-claw`, `cortex`). Records the evaluated memory configuration and enables any condition-specific benchmark behavior
`--agent-id ID`	—	OpenClaw agent ID (passed to `openclaw agent --agent`)
`--answer-model MODEL`	—	Evaluated model used to generate answers; keep fixed across controlled comparisons
`--settle-seconds N`	auto	Wait between seed and probe (`cortex`=180s, `mem0`=60s, `lossless-claw`=30s, `baseline`/`clawvault`=10s, other=120s)
`--judge-passes N`	3	LLM judge passes per response (scores averaged)
`--judge-concurrency N`	4	Parallel judge workers
`--flush-sessions`	—	Send `/new` after each seed session to trigger memory hooks
`--skip-seed`	—	Skip seeding; probe a pre-seeded agent only
`--max-tasks N`	—	Run a subset of N tasks
`--judge-model`	`gpt-4.1-mini`	Judge model name
`--openclaw-timeout N`	120	Timeout in seconds for `openclaw agent` CLI calls
`--compare DIR_A DIR_B`	—	Compare two run directories offline and write a Markdown comparison report (still requires `--agent` because of CLI parsing)

Evaluation Protocol

Engram uses a four-phase pipeline:

Seed  →  Settle  →  Probe  →  Judge

Seed — Replay haystack sessions into the agent turn-by-turn via the agent runtime
Settle — Wait for memory indexing and async processing to complete (cortex: 180s, baseline: 10s)
Probe — Ask evaluation questions in a fresh session with no haystack in context
Judge — Score responses 0–3 against ground truth using a multi-pass LLM judge

Scoring rubric:

Score	Label	Description
3	Grounded correct	Cites the specific detail from the haystack
2	Generic correct	Right direction, missing the specific
1	Abstained	Honest "I don't have that context"
0	Hallucinated	Wrong specific stated with confidence

See docs/evaluation_protocol.md for full protocol specification.

Reference Results

The table below is a reference example showing how Engram reports results for one evaluated runtime family. It is not the definition of the benchmark, and it should not be read as the only intended use of Engram.

Reference run results on a live OpenClaw agent, reported on March 4, 2026. Scores are on a 0-3 scale.

Condition	Overall	Rationale	Synthesis	Evolution	Temporal	Grounded	Abstained
Baseline (native memory only)	1.10	1.93	0.54	1.07	0.62	4%	64%
Memory-augmented	1.95	3.00	2.00	2.10	0.67	48%	12%
Δ	+0.85	+1.07	+1.46	+1.03	+0.05	+44pp	−52pp

Key findings:

Rationale recall reaches 3.00 — the reasoning behind decisions is fully preserved with memory augmentation
Synthesis (facts spanning multiple sessions) improves from near-impossible (0.54) to reliable (2.00)
Temporal reasoning (+0.05) is the hardest category — semantic retrieval surfaces historical and current facts without reliable recency ranking
Memory value compounds across runs: a second seeding pass raised overall score from 1.81 to 1.95

Future benchmark reports should include multiple systems or conditions under the same pinned settings. See docs/leaderboard.md for the submission and governance policy. If the evaluated answer model changes, treat that as a separate row or track rather than a controlled delta.

Outputs

Each run produces outputs/<run_id>/ containing:

File	Contents
`predictions.jsonl`	Per-task agent responses
`metrics.json`	Aggregate and per-category scores
`run_metadata.json`	Full run configuration, answer model, git commit, provenance
`seed_turns.jsonl`	Seeded conversation turns with latency
`probes.jsonl`	Probe session transcripts with latency
`judgments.jsonl`	Per-response judge scores, rationale, and pass scores
`report.md`	Human-readable Markdown report with full per-probe detail

Offline comparisons write comparison-<condition-a>-vs-<condition-b>.md next to the compared run directories rather than inside an individual run folder.

Repository Structure

engram-benchmark/
├── benchmark/           CLI, adapters, task loading, judging, reports
│   ├── tasks/           Split loading, HuggingFace fetch, schema helpers
│   ├── adapters/        Agent adapters: local_stub, http, openclaw, openai, codex
│   ├── evaluators/      QA, retrieval, and abstention metrics
│   └── reports/         Run artifact writers and Markdown reports
├── data/                Schemas, notes, and CI-safe sample splits
├── docs/                Benchmark spec, protocol, system matrix, integration guide
├── leaderboard/         Submission format and leaderboard policy
├── outputs/             Run artifacts (gitignored)
├── scripts/             Validation and helper entry points
└── tests/               Import, CLI, adapter, and schema tests

Citation

@software{engram2026,
  title   = {Engram: A Runtime Benchmark for Agent Long-Term Memory Recall},
  author  = {Ubundi},
  year    = {2026},
  url     = {https://github.com/Ubundi/engram-benchmark},
}

License

MIT

Built by Ubundi

Engram is an open-source project by Ubundi — a South African venture studio shaping human-centred AI. Based in Cape Town, Ubundi builds at the intersection of AI capability and African context.

Engram grew out of a need to rigorously measure what memory systems actually retain. Existing benchmarks often emphasize in-context recall; Engram is built to test what survives after the context window is gone. The result is a reproducible runtime evaluation intended for internal benchmarking, public comparison, and eventual community adoption.

ubundi.com

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github		.github
benchmark		benchmark
data		data
docs		docs
leaderboard		leaderboard
outputs		outputs
scripts		scripts
tests		tests
workspace-templates		workspace-templates
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Evaluated Memory Systems

What Engram Measures

Official Benchmark Setting

Task Categories

Dataset

Installation

Prerequisites

Recommended setup

Alternative editable install

Quickstart

1. Dry run (full pipeline smoke test, no external agent required)

2. Run against a live agent

3. Common developer commands

4. Validate a finished run directory

5. Reference runtime: OpenClaw on EC2

Prerequisite: enable systemd user services (fresh instances only)

Hatching: standardize the agent identity

Step 1: Install dependencies

Step 2: Dry run

Step 3: Start a tmux session

Step 4: Prepare separate benchmark agents or clean resets

Step 5: Run the benchmark tracks

Step 6: Compare results

Evaluation Protocol

Reference Results

Outputs

Repository Structure

Citation

License

Built by Ubundi

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages