HealthFlow: A Self-Evolving MERF Runtime for CodeAct Analysis

HealthFlow is a research framework for self-evolving task execution with a four-stage Meta -> Executor -> Evaluator -> Reflector loop. The core runtime is organized around planning, CodeAct-style execution, structured evaluation, per-task runtime artifacts, and long-term reflective memory. Dataset preparation and benchmark evaluation workflows can still live in the repository under data/, but they are intentionally decoupled from the healthflow/ runtime package.

structured Meta planning with EHR-adaptive memory retrieval
Executor as a CodeAct runtime over external executor backends
Evaluator-driven retry and failure diagnosis
Reflector writeback from both successful and failed trajectories
inspectable workspace artifacts and run telemetry

HealthFlow compares external coding agents through a shared executor abstraction. The maintained built-in backends are claude_code, codex, opencode, and pi, with opencode as the default.

HealthFlow currently ships three user-facing interfaces:

non-interactive CLI: healthflow run ...
interactive CLI: healthflow interactive
web UI: healthflow web

Core Runtime

HealthFlow runs a lean Meta -> Executor -> Evaluator -> Reflector loop.

Meta: retrieve relevant safeguards, dataset anchors, workflows, and code snippets, then emit a structured execution plan.
Executor: interpret the plan as a CodeAct brief and act through code, commands, and workspace artifacts using whatever tools are already configured in the outer executor.
Evaluator: review the execution trace and produced artifacts, classify the outcome as success, needs_retry, or failed, and provide repair instructions for the next attempt.
Reflector: synthesize reusable safeguards, workflows, dataset anchors, or code snippets from the full trajectory after the task session ends.

The task-level self-correction budget is controlled by system.max_attempts, which counts total full attempts through the loop rather than "retries plus one".

What HealthFlow Contributes

MERF core runtime: the framework definition is the four-stage Meta, Executor, Evaluator, Reflector loop rather than an outer benchmark-evaluation pipeline.
Lean execution contract: HealthFlow defines workspace rules, execution-environment defaults, and workflow recommendations without becoming a tool-hosting framework.
Inspectable memory: safeguards, workflows, dataset anchors, and code snippets are stored in JSONL, retrieved through bounded adaptive per-type ranges, and exposed through a saved retrieval audit.
Evaluator-centered recovery: retries are driven by structured failure diagnosis and repair instructions instead of a single scalar score alone.
Reproducibility contract: every task workspace writes structured runtime artifacts instead of only human-readable logs.
Executor telemetry: run artifacts capture executor metadata, backend versions when available, LLM usage, executor usage, and stage-level estimated cost summaries.
Role-specific runtime models: planner, evaluator, reflector, and executor can be configured against different model entries to reduce single-model coupling.

Workspace Artifacts

Runtime state lives under workspace/ by default:

shared app log: workspace/healthflow.log
task artifacts: workspace/tasks/<task_id>/
long-term memory: workspace/memory/experience.jsonl

Dataset preparation and benchmark evaluation assets remain under data/; they are outside the healthflow/ package boundary.

Each task creates a workspace under workspace/tasks/<task_id>/ and writes:

sandbox/
- executor-visible inputs and produced deliverables only
- Pi runs also materialize .healthflow_pi_agent/ here when that backend is active
runtime/index.json
runtime/events.jsonl
runtime/run/summary.json
runtime/run/trajectory.json
runtime/run/costs.json
runtime/run/final_evaluation.json
runtime/attempts/attempt_*/
- planner/: input messages, raw output, parsed output, call metadata, repair trace, plan markdown
- executor/: prompt, command, stdout, stderr, combined log, telemetry, usage, artifact index
- evaluator/: input messages, raw output, parsed output, call metadata, repair trace

When healthflow run ... --report is enabled, the same workspace also writes:

runtime/report.md

These files are the main source of truth for rebuttal-oriented inspection. runtime/report.md is a HealthFlow-generated markdown report designed for end users such as Health Data Scientists. It renders the run as a short-paper-style narrative with sections such as Abstract, Problem, HealthFlow Analysis, Execution, Results, and Conclusion, while keeping runtime JSON/log links in a compact audit appendix.

Runtime Boundary

Core runtime: the MERF loop in healthflow/system.py.
Domain specialization: EHR-specific helpers under healthflow/ehr/.
Dataset prep and benchmark evaluation: repository-level workflows under data/, intentionally decoupled from healthflow/.

The framework package is focused on taking a task, executing it, improving task success rate across attempts, and writing inspectable artifacts and reports for each task run.

Memory Behavior

HealthFlow uses four first-class memory classes:

safeguard
workflow
dataset_anchor
code_snippet

Retrieval is inspectable:

retrieval is conditioned on task family, dataset signature, schema tags, and EHR risk tags
retrieval uses bounded per-type ranges instead of a fixed top-k lane layout
safeguards are retrieved only when their risk_tags match current actionable EHR risks
dataset anchors are retrieved only under exact dataset match
workflows and code snippets are retrieved by task-family, schema, category, and implementation relevance
contradictions are mitigated by suppressing only same-kind memories that share the same category and scope
complementary memories across kinds can coexist, with safeguards taking precedence only if execution behavior conflicts
the retrieval audit is saved per attempt under runtime/attempts/attempt_*/memory/retrieval_result.json

Writeback behavior:

successful tasks can write workflow, dataset_anchor, and code_snippet memory
recovered tasks can write one safeguard, plus one corrected reusable workflow or code_snippet
failed tasks write safeguard memory only
safeguard writeback is reserved for EHR risk-prevention knowledge such as cohort definition, temporal leakage, patient linkage, identifier misuse, unsafe missingness handling, clinically implausible aggregation, and analysis-contract violations
retrieved memories can be explicitly validated or retired based on later trajectories
memories are not retired automatically by retrieval-time competition; retirement requires explicit later evidence

Supported Execution Backends

HealthFlow keeps the executor layer backend-agnostic, but the public surface is intentionally small:

opencode (default)
claude_code
codex
pi

You can still define additional CLI backends in config.toml, but the harness logic stays in HealthFlow rather than being baked into one external backend. Executor-specific repository instruction files are intentionally avoided at the repo root so backend comparisons use the same injected prompt guidance.

External CLI Workflows

HealthFlow does not implement an internal MCP registry, plugin framework, or large CLI catalog. Tool availability belongs to the outer executor layer such as Claude Code, OpenCode, Pi, or Codex.

HealthFlow only supplies:

a lightweight execution-environment contract
small workflow recommendations
documentation recipes for selected external CLIs

When external CLIs are part of the supported workflow, prefer declaring them in this project's pyproject.toml and installing them into the shared repo .venv. Executor backends should use that same project environment rather than ad hoc global tool installs.

Executor defaults are configured for normal text output. HealthFlow does not require external backends to finish in JSON. Structured event streams remain optional backend-specific telemetry modes.

run_benchmark.py always forces memory.write_policy = "freeze" so benchmark evaluation remains decoupled from the framework's self-evolving writeback behavior.

Quick Start

Prerequisites

Python 3.12+
uv
one execution backend available in PATH
- default: opencode
- alternatives: claude, codex, pi

Setup

uv sync
source .venv/bin/activate
export ZENMUX_API_KEY="your_zenmux_key_here"
export DEEPSEEK_API_KEY="your_deepseek_key_here"

The repo already ships a ready-to-edit config.toml. Update that file with the model entries you want to expose to HealthFlow. If you prefer to write your own from scratch, use the same shape and keep secrets in api_key_env:

[llm."deepseek/deepseek-chat"]
api_key_env = "DEEPSEEK_API_KEY"
base_url = "https://api.deepseek.com"
model_name = "deepseek-chat"
executor_model_name = "deepseek-chat"
executor_provider = "deepseek"
executor_provider_base_url = "https://api.deepseek.com/anthropic"
executor_provider_api = "anthropic-messages"
executor_provider_api_key_env = "DEEPSEEK_API_KEY"
input_cost_per_million_tokens = 0.28
output_cost_per_million_tokens = 0.43

[llm."deepseek/deepseek-reasoner"]
api_key_env = "DEEPSEEK_API_KEY"
base_url = "https://api.deepseek.com"
model_name = "deepseek-reasoner"
reasoning_effort = "high"
executor_model_name = "deepseek-reasoner"
executor_provider = "deepseek"
executor_provider_base_url = "https://api.deepseek.com/anthropic"
executor_provider_api = "anthropic-messages"
executor_provider_api_key_env = "DEEPSEEK_API_KEY"

[llm."openai/gpt-5.4"]
api_key_env = "ZENMUX_API_KEY"
base_url = "https://zenmux.ai/api/v1"
model_name = "openai/gpt-5.4"
input_cost_per_million_tokens = 2.50
output_cost_per_million_tokens = 15.00

[llm."google/gemini-3-flash-preview"]
api_key_env = "ZENMUX_API_KEY"
base_url = "https://zenmux.ai/api/v1"
model_name = "google/gemini-3-flash-preview"
input_cost_per_million_tokens = 0.50
output_cost_per_million_tokens = 3.00

[runtime]
planner_llm = "deepseek/deepseek-chat"
evaluator_llm = "openai/gpt-5.4"
reflector_llm = "google/gemini-3-flash-preview"
executor_llm = "deepseek/deepseek-chat"

api_key still works for inline secrets, but api_key_env is the recommended path. Use quoted TOML table names for model keys that contain /.

If you want estimated LLM cost summaries in run artifacts, set input_cost_per_million_tokens and output_cost_per_million_tokens for any model entry used by the planner, evaluator, or reflector in config.toml. If those fields are omitted, HealthFlow skips cost estimation for that model. opencode executor runs also record per-step executor token usage and estimated executor cost when the CLI returns structured telemetry.

By default, the active executor inherits the same model_name as the selected runtime.executor_llm, except for codex, which is pinned to openai/gpt-5.4 in the repo defaults because that is the only Codex model/provider path currently verified in this setup. Override the executor-side model only if you explicitly want the planner/evaluator model and the backend model to diverge for an experiment.

For official DeepSeek models, HealthFlow also inherits executor-specific routing fields. opencode uses its builtin deepseek provider directly, so no custom provider override is required if your DeepSeek credential is already configured in opencode. pi and claude_code inherit the same DeepSeek model but route through DeepSeek's Anthropic-compatible endpoint.

The built-in executor defaults also enable reasoning-oriented modes out of the box:

opencode: --variant high --format json
codex: model_reasoning_effort="high" and model_reasoning_summary="detailed"
pi: --thinking high
claude_code: --effort high

These are still ordinary backend settings in config.toml, so you can override them per executor for large experiment sweeps.

Example executor configuration with ZenMux-backed defaults:

[executor.backends.opencode]
binary = "opencode"
args = ["run", "--variant", "$reasoning_effort", "--format", "json"]
reasoning_effort = "high"
model_flag = "-m"
model_template = "$provider/$model"
provider = "zenmux"

[executor.backends.codex]
binary = "codex"
args = ["exec", "--skip-git-repo-check", "--color", "never", "--dangerously-bypass-approvals-and-sandbox"]
arg_templates = ["-c", "model_provider=\"$provider\"", "-c", "model_providers.$provider={name=\"ZenMux\", base_url=\"$provider_base_url\", env_key=\"$provider_api_key_env\", wire_api=\"responses\"}", "-c", "model_reasoning_effort=\"$reasoning_effort\"", "-c", "model_reasoning_summary=\"detailed\""]
reasoning_effort = "high"
model = "openai/gpt-5.4"
model_flag = "-m"
inherit_executor_llm = false
provider = "zenmux"
provider_base_url = "https://zenmux.ai/api/v1"
provider_api_key_env = "ZENMUX_API_KEY"

[executor.backends.pi]
binary = "pi"
args = ["--print", "--thinking", "$reasoning_effort"]
reasoning_effort = "high"
provider_flag = "--provider"
model_flag = "--model"
provider = "zenmux"
provider_base_url = "https://zenmux.ai/api/v1"
provider_api = "openai-completions"
provider_api_key_env = "ZENMUX_API_KEY"

[executor.backends.claude_code]
binary = "claude"
args = ["--bare", "--setting-sources", "local", "--dangerously-skip-permissions", "--print", "--output-format", "text", "--effort", "$reasoning_effort"]
reasoning_effort = "high"
env = { CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC = "1" }
model_flag = "--model"
provider = "zenmux"
provider_base_url = "https://zenmux.ai/api/anthropic"
provider_api = "anthropic-messages"
provider_api_key_env = "ZENMUX_API_KEY"

HealthFlow also exposes a small execution-environment contract:

[environment]
python_version = "3.12"
package_manager = "uv"
install_command = "uv add"
run_prefix = "uv run"

To select explicit runtime models, set:

[runtime]
planner_llm = "deepseek/deepseek-chat"
evaluator_llm = "openai/gpt-5.4"
reflector_llm = "google/gemini-3-flash-preview"
executor_llm = "deepseek/deepseek-chat"

Any model named in [runtime] must also be declared under [llm].

CLI flags override config.toml for the matching role:

--planner-llm
--evaluator-llm
--reflector-llm
--executor-llm

Legacy tool-registration sections are intentionally unsupported. If you previously configured CLI or MCP tools inside HealthFlow, move that setup into the outer executor and keep only the environment defaults above in HealthFlow.

External CLI Recipes

HealthFlow may surface selected external CLIs when they are available in the project environment, but it does not install, register, or invoke them directly. In orchestrated runs, the planner and executor prompts receive the applicable local CLI contracts that HealthFlow can resolve from the project environment today, such as oneehr or tu / tooluniverse. Applicability still matters: oneehr is mainly useful for EHR workflows, while ToolUniverse is mainly useful for biomedical tool lookup and execution.

ToolUniverse CLI examples:

uv run tu list
uv run tu find "pathway analysis"
uv run tu info <tool-name>
uv run tu run <tool-name> --help
uv run tu status
uv run tu serve

ToolUniverse also supports a local .tooluniverse/profile.yaml workspace and can launch its own MCP server with tu serve, but HealthFlow does not manage that MCP surface.

OneEHR CLI examples:

uv run oneehr preprocess --help
uv run oneehr train --help
uv run oneehr test --help
uv run oneehr analyze --help
uv run oneehr plot --help
uv run oneehr convert --help

Three Modes

You can use either invocation style throughout this README:

packaged CLI: uv run healthflow ...
direct script: python run_healthflow.py ...

Non-Interactive CLI

Use this mode for one-shot runs, scripts, and CI. Each invocation creates a fresh task workspace.

uv run healthflow run \
  "Analyze the uploaded sales.csv and summarize the top 3 drivers of revenue decline." \
  --active-executor opencode \
  --report

python run_healthflow.py run \
  "Analyze the uploaded sales.csv and summarize the top 3 drivers of revenue decline." \
  --active-executor opencode \
  --report

The same CLI can also run EHR-focused prompts used in the paper and arbitrary external-CLI-driven workflows. When --report is enabled, HealthFlow writes workspace/tasks/<task_id>/runtime/report.md after the run finishes, even for failed runs, so a reviewer can inspect the task outcome from a single paper-style markdown artifact before exporting it to PDF or other formats.

To override the configured runtime models from the CLI, pass any subset of: --planner-llm, --evaluator-llm, --reflector-llm, --executor-llm.

Interactive CLI

Use this mode when you want a terminal chat workflow. Follow-up prompts stay on the same task until you use /new.

uv run healthflow interactive \
  --active-executor opencode

python run_healthflow.py interactive \
  --active-executor opencode

Interactive mode now supports a command-aware shell:

/help: show commands and keyboard hints
/clear: clear the terminal and redraw the session banner
/new: start a fresh local session while preserving workspace/memory/experience.jsonl
/exit: exit interactive mode
exit / quit: aliases for /exit
Type / in column 1 to open slash-command suggestions
Tab: complete slash commands
ESC ESC: cancel the current run without leaving the shell

Web UI

Use this mode when you want a browser-based task session with uploads, trace streaming, and artifact download links. Follow-up messages stay on the same task until you click New Task, and refreshing the page restores that task session.

If you have not installed the web dependency yet, run:

uv sync --extra web

uv run healthflow web

python run_healthflow.py web

Optional flags:

--server-name to change the bind address
--server-port to change the port
--share to request a temporary Gradio share link
--root-path to serve the Gradio UI behind a proxy prefix such as /app

For subpath deployments, you can also set GRADIO_ROOT_PATH=/app (or HEALTHFLOW_WEB_ROOT_PATH=/app) before launching healthflow web.

Training

Training data must be JSONL with qid, task, and answer.

python run_training.py data/train_set.jsonl ehrflow_train \
  --active-executor opencode

Benchmarking

Benchmarking is just batch task execution over the same JSONL task shape used elsewhere in the runtime. Dataset construction, benchmark-specific preparation, and benchmark-side evaluation are not part of the healthflow/ package and should be handled under data/ or other repo-level tooling.

python run_benchmark.py path/to/tasks.jsonl experiment_name \
  --active-executor opencode

Results are written under benchmark_results/<dataset>/<executor>/<runtime_selection>/ with per-task copies of the workspace artifacts and dataset-level summary JSON.

For a minimal executor smoke test, use executor_smoke.jsonl with any built-in backend.

Benchmark Framing

EHRFlowBench is a paper-derived proxy benchmark. The canonical source of truth is the locally rebuilt task prompt plus processed/expected/<qid>/, not the original paper metric table.
data/ehrflowbench/processed/paper_map.csv is a local rebuild artifact that records provenance, proxy linkage mode, source-task eligibility, and review status for every canonical task.
MedAgentBoard is a deterministic workflow benchmark grounded in local TJH and MIMIC demo data prepared under data/medagentboard/.

Configuration

Main config sections:

[llm.*]: model registry entries, with either api_key or api_key_env
[runtime]: planner/evaluator/reflector/executor model selection
[executor]: default backend and CLI backend definitions
[environment]: lightweight runtime defaults such as preferred Python version and uv command prefixes
[memory]: runtime write policy only (append, freeze, or reset_before_run)
[evaluation]: evaluator success threshold
[system]: workspace and task-attempt settings (workspace_dir, max_attempts)
[logging]: log level and log file

By default, [system].workspace_dir points to workspace/tasks, relative [logging].log_file values resolve under the workspace root (so healthflow.log becomes workspace/healthflow.log), and CLI entrypoints use workspace/memory/experience.jsonl for shared long-term memory unless overridden.

Repository Layout

run_healthflow.py: non-interactive CLI, interactive CLI, and web UI entrypoint
run_training.py: dataset-style batch runner over task JSONL files
run_benchmark.py: batch task runner over task JSONL files
healthflow/system.py: orchestration loop
healthflow/execution/: executor layer
healthflow/ehr/: optional EHR specialization helpers kept outside the core loop
healthflow/experience/: EHR-adaptive memory and retrieval audit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HealthFlow: A Self-Evolving MERF Runtime for CodeAct Analysis

Core Runtime

What HealthFlow Contributes

Workspace Artifacts

Runtime Boundary

Memory Behavior

Supported Execution Backends

External CLI Workflows

Quick Start

Prerequisites

Setup

External CLI Recipes

Three Modes

Non-Interactive CLI

Interactive CLI

Web UI

Training

Benchmarking

Benchmark Framing

Configuration

Repository Layout

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 433 Commits
assets		assets
data		data
healthflow		healthflow
platform		platform
scripts		scripts
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
config.toml		config.toml
pyproject.toml		pyproject.toml
run_benchmark.py		run_benchmark.py
run_healthflow.py		run_healthflow.py
run_training.py		run_training.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

HealthFlow: A Self-Evolving MERF Runtime for CodeAct Analysis

Core Runtime

What HealthFlow Contributes

Workspace Artifacts

Runtime Boundary

Memory Behavior

Supported Execution Backends

External CLI Workflows

Quick Start

Prerequisites

Setup

External CLI Recipes

Three Modes

Non-Interactive CLI

Interactive CLI

Web UI

Training

Benchmarking

Benchmark Framing

Configuration

Repository Layout

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages