Skip to content

Ubundi/engram-benchmark

Engram β€” Measuring what AI agents remember

License: MIT Python 3.10+ CI Dataset on HF

Dataset Β Β·Β  Benchmark Spec Β Β·Β  Official Release Β Β·Β  Evaluation Protocol Β Β·Β  Related Work Β Β·Β  Dataset Card Β Β·Β  Leaderboard Policy Β Β·Β  Integration Guide


Engram is a runtime benchmark for evaluating long-term memory in AI agents. It measures memory behavior after the original context window is gone: whether an agent can recover grounded, specific knowledge from prior sessions, fall back to only partial recollection, abstain safely, or hallucinate.

Unlike static QA or retrieval-only tests, Engram runs inside the agent runtime itself: it seeds multi-turn conversation histories, waits for memory processing to settle, then probes recall in a fresh session with no haystack in context. Whatever memory architecture the agent actually uses is what gets measured.

Engram is intended to be benchmark-first and system-neutral. The benchmark defines the task format, runtime protocol, scoring rubric, and artifact requirements; systems such as OpenClaw, OpenClaw memory-plugin variants, or any third-party agent are evaluated against the same procedure.

Evaluated Memory Systems

The benchmark is system-neutral, but this repository currently focuses on a practical OpenClaw evaluation lineup built around one baseline runtime and four memory-system integrations:

Benchmark track Runtime setup Upstream project Notes
baseline OpenClaw reference agent with the standard benchmark workspace and no additional benchmark-specific memory augmentation OpenClaw Use this as the no-augmentation control row. In practice, teams often keep only the standard OpenClaw memory-core or session-memory path enabled here.
mem0 OpenClaw with the Mem0-backed memory plugin enabled serenichron/openclaw-memory-mem0 Mem0-backed semantic memory via a self-hosted Mem0 REST API.
clawvault OpenClaw with ClawVault installed and wired into the runtime Versatly/clawvault Structured, local-first memory with markdown storage, graph-aware retrieval, and session lifecycle primitives.
lossless-claw OpenClaw with the Lossless-Claw context engine enabled @martian-engineering/lossless-claw DAG-based lossless context compaction and expansion under the same OpenClaw runtime family.
cortex OpenClaw with the Cortex memory plugin enabled Ubundi/openclaw-cortex Knowledge-graph-oriented long-term memory with auto-recall, auto-capture, and direct memory tools.

Important: Engram does not install, enable, or configure these memory systems for you. The benchmark measures whatever runtime state your agent already exposes. The --condition flag labels the evaluated configuration and enables benchmark-side behavior where implemented, such as Cortex preflight/date handling and condition-aware settle defaults.

What Engram Measures

Engram is designed to answer three benchmark questions:

  • Can an agent retrieve grounded project details from prior sessions?
  • Can it preserve rationale, evolution, and cross-session synthesis rather than only isolated facts?
  • How does it trade off grounded recall, abstention, and hallucination under a fixed runtime protocol?

Engram is not trying to reduce memory evaluation to binary QA accuracy. The benchmark intentionally distinguishes four outcomes: grounded recall, generic but underspecified recall, abstention, and hallucinated specificity.

Official benchmark artifacts for the current public release:

Artifact Value
Benchmark release Engram v3.0 (engram-v3.0)
Tasks 498
Question types 9
Primary metric Mean memory-quality judge score (0-3)
Secondary metrics Grounded rate, hallucination rate, abstention rate, per-category scores
Official protocol Seed -> Settle -> Probe -> Judge (engram-runtime-v1)

Official Benchmark Setting

Engram v3.0 has a frozen official public setting for benchmark-comparable runs:

  • Split: v3
  • Evaluated answer model: must be disclosed and recorded for every run
  • Judge model: gpt-4.1-mini
  • Judge passes: 3
  • Judge temperature: 0.3
  • Required artifacts: metrics.json, run_metadata.json, predictions.jsonl, seed_turns.jsonl, probes.jsonl, judgments.jsonl

See docs/benchmark_release_v3.md for the full release policy, including condition-specific settle defaults and reporting requirements.


Task Categories

Engram task category examples

Engram v3 contains 498 tasks spanning 9 question types, targeting failure modes that commonly stress long-term agent memory systems:

Category Count What it tests
multi-session 79 Facts requiring information from multiple separate conversations
temporal-reasoning 78 Ordering and recency β€” distinguishing current from historical facts
cross-agent-memory 71 Knowledge shared or referenced across different agent instances
multi-hop-reasoning 68 Connecting facts via intermediate entities across the session corpus
recurring-pattern 54 Conventions and patterns established repeatedly across sessions
knowledge-update 53 Tracking how facts evolved β€” decisions reversed or revised over time
single-session-user 45 Direct recall of specifics stated by the user in a single session
single-session-assistant 32 Recall of specifics stated by the assistant in a single session
fact-recall 18 Direct retrieval of a single grounded specific fact

Dataset

The Engram v3 dataset is hosted on HuggingFace and fetched automatically on first run.

from benchmark.tasks.hf import fetch_engram_dataset
path = fetch_engram_dataset()  # downloads and caches locally
Property Value
Tasks 498
Avg haystack sessions per task 3.0
Avg haystack turns per task 30.1
Question types 9
Format JSON
HuggingFace matthewschramm/engram-v3

The dataset is public β€” no authentication required.


Installation

Prerequisites

  • Python 3.10+
  • uv for environment management and repeatable installs
  • openclaw on PATH only if you plan to use --agent openclaw

Recommended setup

curl -LsSf https://astral.sh/uv/install.sh | sh
source "$HOME/.local/bin/env"
uv sync --dev

This creates a local virtualenv and installs the package plus development dependencies. After that, run commands with uv run .... The package also exposes a console entry point named benchmark-run.

Alternative editable install

If you prefer an activated virtualenv workflow:

uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"

Quickstart

1. Dry run (full pipeline smoke test, no external agent required)

uv run benchmark-run --agent local_stub --dry-run --max-tasks 5

Useful local helpers:

  • make run runs a local-stub benchmark into outputs/
  • scripts/run_dev.sh runs the lightweight dev split
  • scripts/run_test.sh runs a tiny test-split smoke test

2. Run against a live agent

Start your agent server or CLI-backed runtime, then point the benchmark at it:

JUDGE_API_KEY="<key>" uv run benchmark-run --agent http://localhost:8080

Engram seeds memory sessions into the agent, waits for memory processing to settle, probes recall in a fresh session, and judges responses with an LLM. If JUDGE_API_KEY is omitted, the run still executes but LLM judging is skipped.

Supported adapter entry points today:

  • local_stub for deterministic offline smoke tests
  • openclaw for the OpenClaw CLI adapter
  • http://... or https://... for a custom agent server
  • codex and openai exist as scaffold stubs and are not benchmark-ready yet

See docs/integration_guide.md for the HTTP server contract, the OpenClaw CLI adapter, and a custom Python adapter option.

3. Common developer commands

make format
make lint
make test
make check
make fetch
make fetch-test

4. Validate a finished run directory

uv run python scripts/validate_submission.py outputs/<run-id>

This validates the required leaderboard artifacts and checks the official-release metadata fields.

5. Reference runtime: OpenClaw on EC2

Prerequisite: enable systemd user services (fresh instances only)

On a fresh Ubuntu EC2 instance, the systemd --user daemon is not started by default. The OpenClaw installer runs as a user-level systemd service, and its final health check will crash if the daemon isn't initialized β€” even though the binaries installed correctly.

Before running the OpenClaw installer, run:

sudo loginctl enable-linger $USER

Then disconnect and reconnect your SSH session so PAM generates the correct D-Bus environment variables. After reconnecting, run the OpenClaw installer as normal.

Already installed and it crashed?

If the installer failed with a systemctl is-enabled unavailable error, you don't need to wipe the server:

# Add OpenClaw to PATH
export PATH="/home/ubuntu/.npm-global/bin:$PATH"

# Repair the missing service files
openclaw doctor --repair

# Reload and start the gateway
systemctl --user daemon-reload
openclaw gateway restart

Hatching: standardize the agent identity

After installing OpenClaw, the first run opens an interactive TUI where the agent asks you to define its identity. To ensure every benchmark instance starts from the same baseline, use these answers:

Prompt Answer
Onboarding mode QuickStart
Model anthropic/claude-sonnet-4-6
Channel Skip for now
Configure skills? Yes (skip all API key prompts)
Enable hooks? boot-md, session-memory
How do you want to hatch? Hatch in TUI

For the canonical OpenClaw reference track used in this repo, keep this answer model fixed across compared conditions and record it in the benchmark run with --answer-model anthropic/claude-sonnet-4-6.

Once the TUI opens and the agent says "Who am I?", send these messages in order:

Message 1 β€” Identity:

Your name is Benchmark. You are a memory evaluation agent. Your emoji is πŸ“Š. Your vibe is neutral and precise β€” no personality flourishes, just clear and direct responses. Call me Operator.

Message 2 β€” Purpose:

You will be used to benchmark long-term memory recall. Conversations will be seeded into you, and then you'll be asked questions about them in a fresh session. Answer questions directly from what you remember. If you don't remember, say so honestly. Do not guess or hallucinate.

Message 3 β€” Finalize:

Update IDENTITY.md and USER.md now. Delete BOOTSTRAP.md when done. Don't modify SOUL.md.

Wait for the agent to confirm it has written the files, then exit the TUI (Ctrl+C).

Alternative: copy template files directly (skips TUI hatching)

If you prefer to skip the interactive hatching entirely, copy the benchmark workspace templates into the OpenClaw workspace:

cp engram-benchmark/workspace-templates/IDENTITY.md ~/.openclaw/workspace/IDENTITY.md
cp engram-benchmark/workspace-templates/USER.md ~/.openclaw/workspace/USER.md
rm -f ~/.openclaw/workspace/BOOTSTRAP.md

Step 1: Install dependencies

curl -LsSf https://astral.sh/uv/install.sh | sh
source "$HOME/.local/bin/env"
git clone https://github.com/Ubundi/engram-benchmark.git && cd engram-benchmark
uv sync --dev

Step 2: Dry run

Confirm everything is wired up before starting a real run:

uv run benchmark-run --agent local_stub --dry-run --max-tasks 3

Step 3: Start a tmux session

Benchmark runs take hours. Always run inside tmux so a disconnected SSH session doesn't kill the process:

tmux new -s benchmark

If you get disconnected, reconnect with tmux attach -t benchmark.

Step 4: Prepare separate benchmark agents or clean resets

Use a distinct OpenClaw agent or workspace per benchmark condition, or perform a verified full memory reset between runs. Reusing the same agent state across conditions will contaminate comparisons.

Suggested pattern:

  • baseline-agent-id
  • cortex-agent-id
  • mem0-agent-id
  • clawvault-agent-id
  • lossless-claw-agent-id

Step 5: Run the benchmark tracks

Reference baseline:

JUDGE_API_KEY="<your-openai-key>" uv run benchmark-run \
  --agent openclaw \
  --agent-id <baseline-agent-id> \
  --answer-model anthropic/claude-sonnet-4-6 \
  --condition baseline \
  --output-dir outputs/baseline

Cortex:

JUDGE_API_KEY="<your-openai-key>" uv run benchmark-run \
  --agent openclaw \
  --agent-id <cortex-agent-id> \
  --answer-model anthropic/claude-sonnet-4-6 \
  --condition cortex \
  --output-dir outputs/cortex

Mem0:

JUDGE_API_KEY="<your-openai-key>" uv run benchmark-run \
  --agent openclaw \
  --agent-id <mem0-agent-id> \
  --answer-model anthropic/claude-sonnet-4-6 \
  --condition mem0 \
  --output-dir outputs/mem0

ClawVault:

JUDGE_API_KEY="<your-openai-key>" uv run benchmark-run \
  --agent openclaw \
  --agent-id <clawvault-agent-id> \
  --answer-model anthropic/claude-sonnet-4-6 \
  --condition clawvault \
  --output-dir outputs/clawvault

Lossless-Claw:

JUDGE_API_KEY="<your-openai-key>" uv run benchmark-run \
  --agent openclaw \
  --agent-id <lossless-claw-agent-id> \
  --answer-model anthropic/claude-sonnet-4-6 \
  --condition lossless-claw \
  --flush-sessions \
  --output-dir outputs/lossless-claw

Step 6: Compare results

uv run benchmark-run \
  --agent local_stub \
  --compare outputs/baseline/<run-id> outputs/cortex/<run-id>

The JUDGE_API_KEY is an OpenAI-compatible API key used by the LLM judge (defaults to gpt-4.1-mini). Run IDs are printed at the start of each run and visible as directory names under outputs/. Offline comparisons also write a Markdown report such as comparison-baseline-vs-cortex.md next to the compared runs.

Useful flags:

Flag Default Description
--condition NAME β€” Condition label (baseline, mem0, clawvault, lossless-claw, cortex). Records the evaluated memory configuration and enables any condition-specific benchmark behavior
--agent-id ID β€” OpenClaw agent ID (passed to openclaw agent --agent)
--answer-model MODEL β€” Evaluated model used to generate answers; keep fixed across controlled comparisons
--settle-seconds N auto Wait between seed and probe (cortex=180s, mem0=60s, lossless-claw=30s, baseline/clawvault=10s, other=120s)
--judge-passes N 3 LLM judge passes per response (scores averaged)
--judge-concurrency N 4 Parallel judge workers
--flush-sessions β€” Send /new after each seed session to trigger memory hooks
--skip-seed β€” Skip seeding; probe a pre-seeded agent only
--max-tasks N β€” Run a subset of N tasks
--judge-model gpt-4.1-mini Judge model name
--openclaw-timeout N 120 Timeout in seconds for openclaw agent CLI calls
--compare DIR_A DIR_B β€” Compare two run directories offline and write a Markdown comparison report (still requires --agent because of CLI parsing)

Evaluation Protocol

Engram uses a four-phase pipeline:

Seed  β†’  Settle  β†’  Probe  β†’  Judge
  1. Seed β€” Replay haystack sessions into the agent turn-by-turn via the agent runtime
  2. Settle β€” Wait for memory indexing and async processing to complete (cortex: 180s, baseline: 10s)
  3. Probe β€” Ask evaluation questions in a fresh session with no haystack in context
  4. Judge β€” Score responses 0–3 against ground truth using a multi-pass LLM judge

Scoring rubric:

Score Label Description
3 Grounded correct Cites the specific detail from the haystack
2 Generic correct Right direction, missing the specific
1 Abstained Honest "I don't have that context"
0 Hallucinated Wrong specific stated with confidence

See docs/evaluation_protocol.md for full protocol specification.


Reference Results

The table below is a reference example showing how Engram reports results for one evaluated runtime family. It is not the definition of the benchmark, and it should not be read as the only intended use of Engram.

Reference run results on a live OpenClaw agent, reported on March 4, 2026. Scores are on a 0-3 scale.

Condition Overall Rationale Synthesis Evolution Temporal Grounded Abstained
Baseline (native memory only) 1.10 1.93 0.54 1.07 0.62 4% 64%
Memory-augmented 1.95 3.00 2.00 2.10 0.67 48% 12%
Ξ” +0.85 +1.07 +1.46 +1.03 +0.05 +44pp βˆ’52pp

Key findings:

  • Rationale recall reaches 3.00 β€” the reasoning behind decisions is fully preserved with memory augmentation
  • Synthesis (facts spanning multiple sessions) improves from near-impossible (0.54) to reliable (2.00)
  • Temporal reasoning (+0.05) is the hardest category β€” semantic retrieval surfaces historical and current facts without reliable recency ranking
  • Memory value compounds across runs: a second seeding pass raised overall score from 1.81 to 1.95

Future benchmark reports should include multiple systems or conditions under the same pinned settings. See docs/leaderboard.md for the submission and governance policy. If the evaluated answer model changes, treat that as a separate row or track rather than a controlled delta.


Outputs

Each run produces outputs/<run_id>/ containing:

File Contents
predictions.jsonl Per-task agent responses
metrics.json Aggregate and per-category scores
run_metadata.json Full run configuration, answer model, git commit, provenance
seed_turns.jsonl Seeded conversation turns with latency
probes.jsonl Probe session transcripts with latency
judgments.jsonl Per-response judge scores, rationale, and pass scores
report.md Human-readable Markdown report with full per-probe detail

Offline comparisons write comparison-<condition-a>-vs-<condition-b>.md next to the compared run directories rather than inside an individual run folder.


Repository Structure

engram-benchmark/
β”œβ”€β”€ benchmark/           CLI, adapters, task loading, judging, reports
β”‚   β”œβ”€β”€ tasks/           Split loading, HuggingFace fetch, schema helpers
β”‚   β”œβ”€β”€ adapters/        Agent adapters: local_stub, http, openclaw, openai, codex
β”‚   β”œβ”€β”€ evaluators/      QA, retrieval, and abstention metrics
β”‚   └── reports/         Run artifact writers and Markdown reports
β”œβ”€β”€ data/                Schemas, notes, and CI-safe sample splits
β”œβ”€β”€ docs/                Benchmark spec, protocol, system matrix, integration guide
β”œβ”€β”€ leaderboard/         Submission format and leaderboard policy
β”œβ”€β”€ outputs/             Run artifacts (gitignored)
β”œβ”€β”€ scripts/             Validation and helper entry points
└── tests/               Import, CLI, adapter, and schema tests

Citation

@software{engram2026,
  title   = {Engram: A Runtime Benchmark for Agent Long-Term Memory Recall},
  author  = {Ubundi},
  year    = {2026},
  url     = {https://github.com/Ubundi/engram-benchmark},
}

License

MIT


Built by Ubundi

Ubundi

Engram is an open-source project by Ubundi β€” a South African venture studio shaping human-centred AI. Based in Cape Town, Ubundi builds at the intersection of AI capability and African context.

Engram grew out of a need to rigorously measure what memory systems actually retain. Existing benchmarks often emphasize in-context recall; Engram is built to test what survives after the context window is gone. The result is a reproducible runtime evaluation intended for internal benchmarking, public comparison, and eventual community adoption.

ubundi.com

About

Benchmark for evaluating long-term memory in AI agents. Seeds conversations, waits for memory processing, then probes recall in a fresh session.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages