HealthFlow is a research framework for self-evolving task execution with a four-stage Meta -> Executor -> Evaluator -> Reflector loop. The core runtime is organized around planning, CodeAct-style execution, structured evaluation, per-task runtime artifacts, and long-term reflective memory. Dataset preparation and benchmark evaluation workflows can still live in the repository under data/, but they are intentionally decoupled from the healthflow/ runtime package.
- structured
Metaplanning with EHR-adaptive memory retrieval Executoras a CodeAct runtime over external executor backendsEvaluator-driven retry and failure diagnosisReflectorwriteback from both successful and failed trajectories- inspectable workspace artifacts and run telemetry
HealthFlow compares external coding agents through a shared executor abstraction. The maintained built-in backends are claude_code, codex, opencode, and pi, with opencode as the default.
HealthFlow currently ships three user-facing interfaces:
- non-interactive CLI:
healthflow run ... - interactive CLI:
healthflow interactive - web UI:
healthflow web
HealthFlow runs a lean Meta -> Executor -> Evaluator -> Reflector loop.
- Meta: retrieve relevant safeguards, dataset anchors, workflows, and code snippets, then emit a structured execution plan.
- Executor: interpret the plan as a CodeAct brief and act through code, commands, and workspace artifacts using whatever tools are already configured in the outer executor.
- Evaluator: review the execution trace and produced artifacts, classify the outcome as
success,needs_retry, orfailed, and provide repair instructions for the next attempt. - Reflector: synthesize reusable safeguards, workflows, dataset anchors, or code snippets from the full trajectory after the task session ends.
The task-level self-correction budget is controlled by system.max_attempts, which counts total full attempts through the loop rather than "retries plus one".
- MERF core runtime: the framework definition is the four-stage Meta, Executor, Evaluator, Reflector loop rather than an outer benchmark-evaluation pipeline.
- Lean execution contract: HealthFlow defines workspace rules, execution-environment defaults, and workflow recommendations without becoming a tool-hosting framework.
- Inspectable memory: safeguards, workflows, dataset anchors, and code snippets are stored in JSONL, retrieved through bounded adaptive per-type ranges, and exposed through a saved retrieval audit.
- Evaluator-centered recovery: retries are driven by structured failure diagnosis and repair instructions instead of a single scalar score alone.
- Reproducibility contract: every task workspace writes structured runtime artifacts instead of only human-readable logs.
- Executor telemetry: run artifacts capture executor metadata, backend versions when available, LLM usage, executor usage, and stage-level estimated cost summaries.
- Role-specific runtime models: planner, evaluator, reflector, and executor can be configured against different model entries to reduce single-model coupling.
Runtime state lives under workspace/ by default:
- shared app log:
workspace/healthflow.log - task artifacts:
workspace/tasks/<task_id>/ - long-term memory:
workspace/memory/experience.jsonl
Dataset preparation and benchmark evaluation assets remain under data/; they are outside the healthflow/ package boundary.
Each task creates a workspace under workspace/tasks/<task_id>/ and writes:
sandbox/- executor-visible inputs and produced deliverables only
- Pi runs also materialize
.healthflow_pi_agent/here when that backend is active
runtime/index.jsonruntime/events.jsonlruntime/run/summary.jsonruntime/run/trajectory.jsonruntime/run/costs.jsonruntime/run/final_evaluation.jsonruntime/attempts/attempt_*/planner/: input messages, raw output, parsed output, call metadata, repair trace, plan markdownexecutor/: prompt, command, stdout, stderr, combined log, telemetry, usage, artifact indexevaluator/: input messages, raw output, parsed output, call metadata, repair trace
When healthflow run ... --report is enabled, the same workspace also writes:
runtime/report.md
These files are the main source of truth for rebuttal-oriented inspection.
runtime/report.md is a HealthFlow-generated markdown report designed for end users such as Health Data Scientists. It renders the run as a short-paper-style narrative with sections such as Abstract, Problem, HealthFlow Analysis, Execution, Results, and Conclusion, while keeping runtime JSON/log links in a compact audit appendix.
- Core runtime: the MERF loop in
healthflow/system.py. - Domain specialization: EHR-specific helpers under
healthflow/ehr/. - Dataset prep and benchmark evaluation: repository-level workflows under
data/, intentionally decoupled fromhealthflow/.
The framework package is focused on taking a task, executing it, improving task success rate across attempts, and writing inspectable artifacts and reports for each task run.
HealthFlow uses four first-class memory classes:
safeguardworkflowdataset_anchorcode_snippet
Retrieval is inspectable:
- retrieval is conditioned on task family, dataset signature, schema tags, and EHR risk tags
- retrieval uses bounded per-type ranges instead of a fixed top-k lane layout
- safeguards are retrieved only when their
risk_tagsmatch current actionable EHR risks - dataset anchors are retrieved only under exact dataset match
- workflows and code snippets are retrieved by task-family, schema, category, and implementation relevance
- contradictions are mitigated by suppressing only same-kind memories that share the same category and scope
- complementary memories across kinds can coexist, with safeguards taking precedence only if execution behavior conflicts
- the retrieval audit is saved per attempt under
runtime/attempts/attempt_*/memory/retrieval_result.json
Writeback behavior:
- successful tasks can write
workflow,dataset_anchor, andcode_snippetmemory - recovered tasks can write one
safeguard, plus one corrected reusableworkfloworcode_snippet - failed tasks write
safeguardmemory only - safeguard writeback is reserved for EHR risk-prevention knowledge such as cohort definition, temporal leakage, patient linkage, identifier misuse, unsafe missingness handling, clinically implausible aggregation, and analysis-contract violations
- retrieved memories can be explicitly
validated orretired based on later trajectories - memories are not retired automatically by retrieval-time competition; retirement requires explicit later evidence
HealthFlow keeps the executor layer backend-agnostic, but the public surface is intentionally small:
opencode(default)claude_codecodexpi
You can still define additional CLI backends in config.toml, but the harness logic stays in HealthFlow rather than being baked into one external backend.
Executor-specific repository instruction files are intentionally avoided at the repo root so backend comparisons use the same injected prompt guidance.
HealthFlow does not implement an internal MCP registry, plugin framework, or large CLI catalog. Tool availability belongs to the outer executor layer such as Claude Code, OpenCode, Pi, or Codex.
HealthFlow only supplies:
- a lightweight execution-environment contract
- small workflow recommendations
- documentation recipes for selected external CLIs
When external CLIs are part of the supported workflow, prefer declaring them in this project's pyproject.toml and installing them into the shared repo .venv. Executor backends should use that same project environment rather than ad hoc global tool installs.
Executor defaults are configured for normal text output. HealthFlow does not require external backends to finish in JSON. Structured event streams remain optional backend-specific telemetry modes.
run_benchmark.py always forces memory.write_policy = "freeze" so benchmark evaluation remains decoupled from the framework's self-evolving writeback behavior.
- Python 3.12+
uv- one execution backend available in
PATH- default:
opencode - alternatives:
claude,codex,pi
- default:
uv sync
source .venv/bin/activate
export ZENMUX_API_KEY="your_zenmux_key_here"
export DEEPSEEK_API_KEY="your_deepseek_key_here"The repo already ships a ready-to-edit config.toml. Update that file with the model entries you want to expose to HealthFlow. If you prefer to write your own from scratch, use the same shape and keep secrets in api_key_env:
[llm."deepseek/deepseek-chat"]
api_key_env = "DEEPSEEK_API_KEY"
base_url = "https://api.deepseek.com"
model_name = "deepseek-chat"
executor_model_name = "deepseek-chat"
executor_provider = "deepseek"
executor_provider_base_url = "https://api.deepseek.com/anthropic"
executor_provider_api = "anthropic-messages"
executor_provider_api_key_env = "DEEPSEEK_API_KEY"
input_cost_per_million_tokens = 0.28
output_cost_per_million_tokens = 0.43
[llm."deepseek/deepseek-reasoner"]
api_key_env = "DEEPSEEK_API_KEY"
base_url = "https://api.deepseek.com"
model_name = "deepseek-reasoner"
reasoning_effort = "high"
executor_model_name = "deepseek-reasoner"
executor_provider = "deepseek"
executor_provider_base_url = "https://api.deepseek.com/anthropic"
executor_provider_api = "anthropic-messages"
executor_provider_api_key_env = "DEEPSEEK_API_KEY"
[llm."openai/gpt-5.4"]
api_key_env = "ZENMUX_API_KEY"
base_url = "https://zenmux.ai/api/v1"
model_name = "openai/gpt-5.4"
input_cost_per_million_tokens = 2.50
output_cost_per_million_tokens = 15.00
[llm."google/gemini-3-flash-preview"]
api_key_env = "ZENMUX_API_KEY"
base_url = "https://zenmux.ai/api/v1"
model_name = "google/gemini-3-flash-preview"
input_cost_per_million_tokens = 0.50
output_cost_per_million_tokens = 3.00
[runtime]
planner_llm = "deepseek/deepseek-chat"
evaluator_llm = "openai/gpt-5.4"
reflector_llm = "google/gemini-3-flash-preview"
executor_llm = "deepseek/deepseek-chat"api_key still works for inline secrets, but api_key_env is the recommended path. Use quoted TOML table names for model keys that contain /.
If you want estimated LLM cost summaries in run artifacts, set input_cost_per_million_tokens and output_cost_per_million_tokens for any model entry used by the planner, evaluator, or reflector in config.toml. If those fields are omitted, HealthFlow skips cost estimation for that model. opencode executor runs also record per-step executor token usage and estimated executor cost when the CLI returns structured telemetry.
By default, the active executor inherits the same model_name as the selected runtime.executor_llm, except for codex, which is pinned to openai/gpt-5.4 in the repo defaults because that is the only Codex model/provider path currently verified in this setup. Override the executor-side model only if you explicitly want the planner/evaluator model and the backend model to diverge for an experiment.
For official DeepSeek models, HealthFlow also inherits executor-specific routing fields. opencode uses its builtin deepseek provider directly, so no custom provider override is required if your DeepSeek credential is already configured in opencode. pi and claude_code inherit the same DeepSeek model but route through DeepSeek's Anthropic-compatible endpoint.
The built-in executor defaults also enable reasoning-oriented modes out of the box:
opencode:--variant high --format jsoncodex:model_reasoning_effort="high"andmodel_reasoning_summary="detailed"pi:--thinking highclaude_code:--effort high
These are still ordinary backend settings in config.toml, so you can override them per executor for large experiment sweeps.
Example executor configuration with ZenMux-backed defaults:
[executor.backends.opencode]
binary = "opencode"
args = ["run", "--variant", "$reasoning_effort", "--format", "json"]
reasoning_effort = "high"
model_flag = "-m"
model_template = "$provider/$model"
provider = "zenmux"
[executor.backends.codex]
binary = "codex"
args = ["exec", "--skip-git-repo-check", "--color", "never", "--dangerously-bypass-approvals-and-sandbox"]
arg_templates = ["-c", "model_provider=\"$provider\"", "-c", "model_providers.$provider={name=\"ZenMux\", base_url=\"$provider_base_url\", env_key=\"$provider_api_key_env\", wire_api=\"responses\"}", "-c", "model_reasoning_effort=\"$reasoning_effort\"", "-c", "model_reasoning_summary=\"detailed\""]
reasoning_effort = "high"
model = "openai/gpt-5.4"
model_flag = "-m"
inherit_executor_llm = false
provider = "zenmux"
provider_base_url = "https://zenmux.ai/api/v1"
provider_api_key_env = "ZENMUX_API_KEY"
[executor.backends.pi]
binary = "pi"
args = ["--print", "--thinking", "$reasoning_effort"]
reasoning_effort = "high"
provider_flag = "--provider"
model_flag = "--model"
provider = "zenmux"
provider_base_url = "https://zenmux.ai/api/v1"
provider_api = "openai-completions"
provider_api_key_env = "ZENMUX_API_KEY"
[executor.backends.claude_code]
binary = "claude"
args = ["--bare", "--setting-sources", "local", "--dangerously-skip-permissions", "--print", "--output-format", "text", "--effort", "$reasoning_effort"]
reasoning_effort = "high"
env = { CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC = "1" }
model_flag = "--model"
provider = "zenmux"
provider_base_url = "https://zenmux.ai/api/anthropic"
provider_api = "anthropic-messages"
provider_api_key_env = "ZENMUX_API_KEY"HealthFlow also exposes a small execution-environment contract:
[environment]
python_version = "3.12"
package_manager = "uv"
install_command = "uv add"
run_prefix = "uv run"To select explicit runtime models, set:
[runtime]
planner_llm = "deepseek/deepseek-chat"
evaluator_llm = "openai/gpt-5.4"
reflector_llm = "google/gemini-3-flash-preview"
executor_llm = "deepseek/deepseek-chat"Any model named in [runtime] must also be declared under [llm].
CLI flags override config.toml for the matching role:
--planner-llm--evaluator-llm--reflector-llm--executor-llm
Legacy tool-registration sections are intentionally unsupported. If you previously configured CLI or MCP tools inside HealthFlow, move that setup into the outer executor and keep only the environment defaults above in HealthFlow.
HealthFlow may surface selected external CLIs when they are available in the project environment, but it does not install, register, or invoke them directly.
In orchestrated runs, the planner and executor prompts receive the applicable local CLI contracts that HealthFlow can resolve from the project environment today, such as oneehr or tu / tooluniverse.
Applicability still matters: oneehr is mainly useful for EHR workflows, while ToolUniverse is mainly useful for biomedical tool lookup and execution.
ToolUniverse CLI examples:
uv run tu list
uv run tu find "pathway analysis"
uv run tu info <tool-name>
uv run tu run <tool-name> --help
uv run tu status
uv run tu serveToolUniverse also supports a local .tooluniverse/profile.yaml workspace and can launch its own MCP server with tu serve, but HealthFlow does not manage that MCP surface.
OneEHR CLI examples:
uv run oneehr preprocess --help
uv run oneehr train --help
uv run oneehr test --help
uv run oneehr analyze --help
uv run oneehr plot --help
uv run oneehr convert --helpYou can use either invocation style throughout this README:
- packaged CLI:
uv run healthflow ... - direct script:
python run_healthflow.py ...
Use this mode for one-shot runs, scripts, and CI. Each invocation creates a fresh task workspace.
uv run healthflow run \
"Analyze the uploaded sales.csv and summarize the top 3 drivers of revenue decline." \
--active-executor opencode \
--reportpython run_healthflow.py run \
"Analyze the uploaded sales.csv and summarize the top 3 drivers of revenue decline." \
--active-executor opencode \
--reportThe same CLI can also run EHR-focused prompts used in the paper and arbitrary external-CLI-driven workflows.
When --report is enabled, HealthFlow writes workspace/tasks/<task_id>/runtime/report.md after the run finishes, even for failed runs, so a reviewer can inspect the task outcome from a single paper-style markdown artifact before exporting it to PDF or other formats.
To override the configured runtime models from the CLI, pass any subset of:
--planner-llm, --evaluator-llm, --reflector-llm, --executor-llm.
Use this mode when you want a terminal chat workflow. Follow-up prompts stay on the same task until you use /new.
uv run healthflow interactive \
--active-executor opencodepython run_healthflow.py interactive \
--active-executor opencodeInteractive mode now supports a command-aware shell:
/help: show commands and keyboard hints/clear: clear the terminal and redraw the session banner/new: start a fresh local session while preservingworkspace/memory/experience.jsonl/exit: exit interactive modeexit/quit: aliases for/exit- Type
/in column 1 to open slash-command suggestions Tab: complete slash commandsESC ESC: cancel the current run without leaving the shell
Use this mode when you want a browser-based task session with uploads, trace streaming, and artifact download links. Follow-up messages stay on the same task until you click New Task, and refreshing the page restores that task session.
If you have not installed the web dependency yet, run:
uv sync --extra webuv run healthflow webpython run_healthflow.py webOptional flags:
--server-nameto change the bind address--server-portto change the port--shareto request a temporary Gradio share link--root-pathto serve the Gradio UI behind a proxy prefix such as/app
For subpath deployments, you can also set GRADIO_ROOT_PATH=/app (or HEALTHFLOW_WEB_ROOT_PATH=/app) before launching healthflow web.
Training data must be JSONL with qid, task, and answer.
python run_training.py data/train_set.jsonl ehrflow_train \
--active-executor opencodeBenchmarking is just batch task execution over the same JSONL task shape used elsewhere in the runtime.
Dataset construction, benchmark-specific preparation, and benchmark-side evaluation are not part of the healthflow/ package and should be handled under data/ or other repo-level tooling.
python run_benchmark.py path/to/tasks.jsonl experiment_name \
--active-executor opencodeResults are written under benchmark_results/<dataset>/<executor>/<runtime_selection>/ with per-task copies of the workspace artifacts and dataset-level summary JSON.
For a minimal executor smoke test, use executor_smoke.jsonl with any built-in backend.
- EHRFlowBench is a paper-derived proxy benchmark. The canonical source of truth is the locally rebuilt task prompt plus
processed/expected/<qid>/, not the original paper metric table. data/ehrflowbench/processed/paper_map.csvis a local rebuild artifact that records provenance, proxy linkage mode, source-task eligibility, and review status for every canonical task.- MedAgentBoard is a deterministic workflow benchmark grounded in local TJH and MIMIC demo data prepared under
data/medagentboard/.
Main config sections:
[llm.*]: model registry entries, with eitherapi_keyorapi_key_env[runtime]: planner/evaluator/reflector/executor model selection[executor]: default backend and CLI backend definitions[environment]: lightweight runtime defaults such as preferred Python version anduvcommand prefixes[memory]: runtime write policy only (append,freeze, orreset_before_run)[evaluation]: evaluator success threshold[system]: workspace and task-attempt settings (workspace_dir,max_attempts)[logging]: log level and log file
By default, [system].workspace_dir points to workspace/tasks, relative [logging].log_file values resolve under the workspace root (so healthflow.log becomes workspace/healthflow.log), and CLI entrypoints use workspace/memory/experience.jsonl for shared long-term memory unless overridden.
run_healthflow.py: non-interactive CLI, interactive CLI, and web UI entrypointrun_training.py: dataset-style batch runner over task JSONL filesrun_benchmark.py: batch task runner over task JSONL fileshealthflow/system.py: orchestration loophealthflow/execution/: executor layerhealthflow/ehr/: optional EHR specialization helpers kept outside the core loophealthflow/experience/: EHR-adaptive memory and retrieval audit