zombie processes & `autoRestart` failures

We've seen some long(er)-running sessions (~20m) that fail unexpectedly and become fully unresponsive, despite having `autoRestart` enabled.

I was able to get one of these to happen in a Docker container, so I had Opus dig around and leave some notes. Not sure how helpful this will be to y'all but leaving it here just in case:

> FWIW: I think its correct. Node's `.on("exit")` is generally for healthy / intentional exits

---

Root Cause: The Copilot CLI subprocess (PID 181, `MainThread`) crashed and became a zombie process. The SDK's `autoRestart` feature should have reconnected, but there are issues:

1. Zombie processes: The container has 15+ zombie processes (bash, git, MainThread) — all children of the bun process (PID 13). This indicates Node.js isn't properly reaping child processes.

3. Timeline of failure:   - Last logged event: `session.truncation` at 18:21:20.501Z
   - Followed by `assistant.turn_start` for turn 14
   - Then silence — no more events logged
   - The VM restarted twice (seen in log with duplicate `NODE_ENV production` lines)
   - But the old zombie processes remain from before the restarts
3. Why autoRestart didn't work: The SDK's reconnect logic fires on exit event:
    ```ts
    this.cliProcess.on("exit", (code) => {
      if (this.options.autoRestart && this.state === "connected") {
        void this.reconnect();
      }
   });
   ```

      But if the stdio pipes get corrupted or the process crashes hard (SIGSEGV), the exit handler might not fire correctly.
4. Most likely culprit: The Copilot CLI uses native prebuilds (`keytar.node`, `pty.node`). A crash in native code (segfault) would explain:
   - Abrupt stop of events after turn 14 started
   - Zombie state (parent didn't get proper exit notification)
   - No error logged
   

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zombie processes & `autoRestart` failures #71

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

zombie processes & autoRestart failures #71

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

zombie processes & `autoRestart` failures #71