Skip to content

JingbiaoMei/ATM-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

45 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

English | δΈ­ζ–‡

ATM-Bench: Long-Term Personalized Referential Memory QA

arXiv License Hugging Face Dataset

Official code for ATM-Bench: a benchmark for long-term multimodal personalized AI memory QA and retrieval.

ATM-Bench is the first benchmark for multimodal, multi-source personalized referential memory QA over long time horizons (~4 years) with evidence-grounded retrieval and answering.

Paper: According to Me: Long-Term Personalized Referential Memory QA
Project Page: https://atmbench.github.io/

Table of Contents

πŸ—“οΈ Timeline

  • 2026-03-03: arXiv paper release (2603.01990)
  • 2026-03-04: Initial codebase release, including baseline implementations for MMRAG, Oracle, NIAH, and four ported third-party baselines (A-Mem, HippoRAG2, mem0, MemoryOS).
  • 2026-03-12: Initial General-Purpose Agent benchmark results release for Claude Code, Codex, and OpenCode.
  • 2026-03-12: ATM-Bench data release on Hugging Face (ATM-Bench).
  • 2026-03-13: Fixed Opencode Token Accounting and updated OpenClaw results.
  • Coming soon: General-Purpose Agents benchmarking support, including OpenClaw.

πŸ€– General-Purpose Agent Results

Initial General-Purpose Agent results on ATM-Bench-Hard are summarized below. The QS score here uses gpt-5-mini as the primary judge. Tokens/QS shows the token cost per percentage point of QS, so lower is more efficient.

Agent Model QS Total Tokens Tokens/QS
Claude Code Claude Opus 4.6 33.80% 4.93M 0.146M
Codex GPT-5.2 39.70% 15.46M 0.389M
Codex GPT-5.4* 29.60% 14.29M 0.483M
OpenCode GLM-5 27.00% 16.89M 0.626M
OpenCode Qwen3.5-397B-A17B 24.50% 12.06M 0.492M
OpenCode Kimi K2.5 30.30% 8.46M 0.279M
OpenCode MiniMax M2.5 22.90% 14.5M 0.633M
OpenCode MiniMax M2.7 27.80% 13.48M 0.485M
OpenClaw 🦞 Kimi K2.5 25.40% 9.63M 0.379M
  • GPT-5.4 results may be unreliable because the Codex service was unstable during evaluation.

The coding agents still struggle on ATM-Bench-Hard, although they perform much better than various agentic memory baselines.

πŸ“Š Oracle and NIAH Results

Oracle on ATM-Bench-Hard

QS is reported with gpt-5-mini as the primary judge.

Model Setting QS
GPT-5 Raw 72.12%
Qwen3-VL-8B-Instruct Raw 40.14%
Qwen3-VL-8B-Instruct SGM 27.98%
Qwen3-VL-8B-Instruct D 21.69%

NIAH on ATM-Bench-Hard

For NIAH, we compare the Qwen3-VL-8B-Instruct SGM and Raw settings at different haystack sizes.

Model Setting QS Avg. Context Tokens
Qwen3-VL-8B-Instruct Raw, Oracle 40.14% 5.7k
Qwen3-VL-8B-Instruct Raw, NIAH-25 25.43% 15.9k
Qwen3-VL-8B-Instruct Raw, NIAH-50 24.87% 29.0k
Qwen3-VL-8B-Instruct Raw, NIAH-100 10.90% 56.0k
Qwen3-VL-8B-Instruct SGM, Oracle 27.98% 4.6k
Qwen3-VL-8B-Instruct SGM, NIAH-25 16.33% 12.5k
Qwen3-VL-8B-Instruct SGM, NIAH-50 15.77% 23.9k
Qwen3-VL-8B-Instruct SGM, NIAH-100 12.66% 45.8k

πŸ“‹ Overview

Existing long-term memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. ATM-Bench addresses this gap with:

  • πŸ–ΌοΈ Multimodal and multi-source data: Images, videos, emails
  • πŸ“… Long-term horizon: ~4 years of personal memory
  • 🎯 Referential queries: Resolving personalized references (e.g., "Show me the moments where Grace was trying to be sneaky...")
  • πŸ” Evidence-grounded: Human-annotated QA pairs with ground-truth memory evidence
  • 🧩 Multi-evidence reasoning: Queries requiring evidence from multiple sources
  • ⚑ Conflicting evidence: Handling contradictory information

ATM-Bench Overview

Memory Ingestion

Memory Ingestion is decomposed into:

  1. Memory preprocessing (how each memory item is represented)
  2. Memory organization (how items are structured/linked)

ATM Method

Memory Preprocessing

We compare two preprocessing representations:

  • Descriptive Memory (DM): each memory item is represented as one natural-language description.
  • Schema-Guided Memory (SGM): each memory item is represented with fixed text-based key-value fields under a schema.

In SGM, schema fields are modality-aware. For example:

  • Image/Video memory: time, location, entities, ocr, tags
  • Email memory: time, summary, body

DM and SGM contain the same underlying information but use different formats.

In this codebase, DM is implemented as caption/description-style text, while SGM is implemented as schema-based key-value text fields.

Memory Organization

For organization of the memory store:

  • Piled Memory: items are stored without explicit links.
  • Linked Memory: items are linked with inferred relations (graph structure); agentic systems can additionally update existing items during organization.

NIAH Evaluation Setup

In addition to end-to-end retrieval + generation evaluation, we provide NIAH (Needle In A Haystack):

  • Each question is paired with a fixed evidence pool (niah_evidence_ids) that contains all ground-truth items.
  • The rest of the pool is filled with realistic distractors.
  • This isolates answer generation/reasoning quality from retrieval quality.

See:

πŸš€ Quick Start

Download Dataset

ATM-Bench is hosted on Hugging Face at Jingbiao/ATM-Bench. A one-shot script downloads the full released dataset and stages the files where the evaluation scripts expect them.

Full download (~3.3 GB) β€” includes QA, NIAH pools, preprocessed memory, emails, raw images, raw videos, and the GPS reverse-geocoding cache:

bash scripts/download_data.sh

This populates:

data/atm-bench/atm-bench.json
data/atm-bench/atm-bench-hard.json
data/atm-bench/niah/...
data/raw_memory/email/emails.json                   # emails
data/raw_memory/image/...                           # raw images
data/raw_memory/video/...                           # raw videos
data/raw_memory/geocoding_cache/...                 # GPS reverse-geocoding cache
output/image/qwen3vl2b/batch_results.json           # preprocessed image memory
output/video/qwen3vl2b/batch_results.json           # preprocessed video memory

The HF files data/processed_memory/{image,video}_batch_results.json are automatically renamed/copied into output/image/qwen3vl2b/batch_results.json and output/video/qwen3vl2b/batch_results.json by the script.

The script uses the huggingface_hub Python package (installed automatically if missing). If the dataset is private, run huggingface-cli login first.

Installation

conda create -n atmbench python=3.11 -y
conda activate atmbench
pip install -r requirements.txt
pip install -e .

API Keys

Set via environment variables:

export OPENAI_API_KEY="your-key"
export VLLM_API_KEY="your-key"

Or use local key files (gitignored):

  • api_keys/.openai_key
  • api_keys/.vllm_key

Prepare Memory Files

Before running baselines, the image/video batch_results.json files must exist under output/{image,video}/qwen3vl2b/. You have two options:

Option A (recommended): download the preprocessed memory from Hugging Face.

If you already ran bash scripts/download_data.sh above, the preprocessed memory files are already staged at:

  • output/image/qwen3vl2b/batch_results.json
  • output/video/qwen3vl2b/batch_results.json

Nothing more to do β€” you can skip straight to the Quick commands.

Option B: regenerate the memory files from raw images/videos.

Only needed if you want to re-run preprocessing (for example, to try a different VLM or your own raw memory). Requires raw images under data/raw_memory/image/ and videos under data/raw_memory/video/:

# Optional but recommended: preload reverse-geocoding cache
# Cache files are keyed by media filename stem, so the cache bundle must match
# the current image/video filenames.
bash scripts/memory_processor/image/copy_gps_cache.sh output/image/qwen3vl2b/cache
bash scripts/memory_processor/video/copy_gps_cache.sh output/video/qwen3vl2b/cache

# Generate memory itemization results
bash scripts/memory_processor/image/memory_itemize/run_qwen3vl2b.sh
bash scripts/memory_processor/video/memory_itemize/run_qwen3vl2b.sh

Quick commands (MMRAG + Oracle)

# MMRAG (runs both ATM-bench and ATM-bench-hard)
#   Needs: `bash scripts/download_data.sh`
#        + a running vLLM endpoint at http://127.0.0.1:8000/v1/chat/completions
#          serving Qwen/Qwen3-VL-8B-Instruct-FP8 (override with VLLM_ENDPOINT /
#          ANSWERER_MODEL env vars).
bash scripts/QA_Agent/MMRAG/run.sh

# Oracle with Qwen3-VL-8B on raw images/videos (local upper bound)
#   Needs: `bash scripts/download_data.sh`
#        + a running vLLM endpoint serving Qwen/Qwen3-VL-8B-Instruct-FP8.
bash scripts/QA_Agent/Oracle/run_oracle_qwen3vl8b_raw.sh

# Oracle with GPT-5 on raw images/videos (no local GPU / vLLM)
#   Needs: `bash scripts/download_data.sh`
#        + OPENAI_API_KEY set in the environment or api_keys/.openai_key.
bash scripts/QA_Agent/Oracle/run_oracle_gpt5.sh

Baseline Compatibility and Environments

  • Core baselines (MMRAG, Oracle, NIAH) are tested in the main atmbench environment.
  • Third-party memory-system baselines in this repo include:
    • A-Mem
    • HippoRAG2
    • mem0
    • MemoryOS
  • MemoryOS is strongly recommended to run in a separate conda environment.
  • A-Mem, HippoRAG2, and mem0 are tested to be compatible with the core baseline environment, but separate environments are still safer for reproducibility and dependency isolation.
  • Setup references for these baselines are under third_party/:
    • third_party/A-mem/
    • third_party/HippoRAG/
    • third_party/mem0/
    • third_party/MemoryOS/
  • OpenClaw support is planned; We will shortly release the evaluation setup for all General-Purpose Agents (Claude Code, Codex, OpenCode, OpenClaw) on ATM-Bench.

For detailed setup, data layout, and reproducibility settings, see:

πŸ“ Repository Structure

ATMBench/
β”œβ”€β”€ memqa/              # Core memory QA implementation
β”œβ”€β”€ scripts/            # Experiment scripts
β”œβ”€β”€ docs/               # Documentation
β”œβ”€β”€ data/               # Data directory (user-provided)
β”œβ”€β”€ third_party/        # Vendored agentic memory systems
└── output/             # Experiment outputs (gitignored)

πŸ“š Documentation

πŸ“– Citation

If you use ATM-Bench in your research, please cite:

@article{mei2026atm,
  title={According to Me: Long-Term Personalized Referential Memory QA},
  author={Mei, Jingbiao and Chen, Jinghong and Yang, Guangyu and Hou, Xinyu and Li, Margaret and Byrne, Bill},
  journal={arXiv preprint arXiv:2603.01990},
  year={2026},
  url={https://arxiv.org/abs/2603.01990},
  doi={10.48550/arXiv.2603.01990}
}

πŸ”— Links

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

ATM-Bench: A benchmark for long-term personalized memory QA spanning ~4 years of multimodal data (images, videos, emails). Features referential queries, evidence-grounded answering, and multi-source reasoning. Paper: "According to Me: Long-Term Personalized Referential Memory QA"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors