-
Notifications
You must be signed in to change notification settings - Fork 438
Description
We've seen some long(er)-running sessions (~20m) that fail unexpectedly and become fully unresponsive, despite having autoRestart enabled.
I was able to get one of these to happen in a Docker container, so I had Opus dig around and leave some notes. Not sure how helpful this will be to y'all but leaving it here just in case:
FWIW: I think its correct. Node's
.on("exit")is generally for healthy / intentional exits
Root Cause: The Copilot CLI subprocess (PID 181, MainThread) crashed and became a zombie process. The SDK's autoRestart feature should have reconnected, but there are issues:
-
Zombie processes: The container has 15+ zombie processes (bash, git, MainThread) — all children of the bun process (PID 13). This indicates Node.js isn't properly reaping child processes.
-
Timeline of failure: - Last logged event:
session.truncationat 18:21:20.501Z- Followed by
assistant.turn_startfor turn 14 - Then silence — no more events logged
- The VM restarted twice (seen in log with duplicate
NODE_ENV productionlines) - But the old zombie processes remain from before the restarts
- Followed by
-
Why autoRestart didn't work: The SDK's reconnect logic fires on exit event:
this.cliProcess.on("exit", (code) => { if (this.options.autoRestart && this.state === "connected") { void this.reconnect(); } });
But if the stdio pipes get corrupted or the process crashes hard (SIGSEGV), the exit handler might not fire correctly.
-
Most likely culprit: The Copilot CLI uses native prebuilds (
keytar.node,pty.node). A crash in native code (segfault) would explain:- Abrupt stop of events after turn 14 started
- Zombie state (parent didn't get proper exit notification)
- No error logged