-
Notifications
You must be signed in to change notification settings - Fork 2k
Idle agents trigger supervisor shutdown after ~30 minutes #904
Description
Problem
With 24 agents (16 manifest + 8 hands) and no active user messages, the daemon shuts itself down every ~30 minutes via SIGTERM from the internal supervisor:
INFO openfang_api::server: Received SIGTERM, shutting down...
INFO openfang_kernel::kernel: Shutting down OpenFang kernel...
INFO openfang_kernel::supervisor: Supervisor: initiating graceful shutdown
Before the shutdown, the heartbeat monitor logs warnings for every idle agent:
WARN openfang_kernel::heartbeat: Agent is unresponsive agent=cfo inactive_secs=210 timeout_secs=180
WARN openfang_kernel::heartbeat: Agent is unresponsive agent=researcher inactive_secs=210 timeout_secs=180
The 180s timeout and 30s heartbeat interval appear hardcoded — setting [timeouts] heartbeat = 86400 in config.toml is accepted by openfang config set but has no effect on the monitor.
Expected behavior
Idle agents should remain available indefinitely. A deployment with no active user traffic should not shut itself down.
Environment
- v0.5.4 and v0.5.5 Linux x86_64 (both affected, v0.5.5 is actually worse — 7 shutdowns in 35 min vs ~1 per 30 min on v0.5.4)
- 24 agents, 8 hands (hand_interval=3600), 5 MCP servers
- Model routing via OpenAI-compatible proxy (not direct provider API)
- Docker with
restart: unless-stopped
Related
Issue #766 was closed as "resolved by heartbeat fixes" in v0.5.4, but the problem persists. The heartbeat still flags idle agents as unresponsive after 180s and eventually the supervisor decides to shut down.
Workaround
External keepalive cron that sends a lightweight /ping message to agents every 2 minutes. This prevents the "unresponsive" warnings but doesn't fully prevent the supervisor SIGTERM (reduced from ~6/hour to ~2/hour).
Questions
- Is there a config key to increase or disable the heartbeat timeout?
[timeouts] heartbeatdoesn't seem to work. - Is the supervisor shutdown triggered by a threshold of unresponsive agents, or something else?
- Should agents that haven't received user messages be considered "unresponsive"? They're available and ready — just idle.