Skip to content

zombie processes & autoRestart failures #71

@lukeed

Description

@lukeed

We've seen some long(er)-running sessions (~20m) that fail unexpectedly and become fully unresponsive, despite having autoRestart enabled.

I was able to get one of these to happen in a Docker container, so I had Opus dig around and leave some notes. Not sure how helpful this will be to y'all but leaving it here just in case:

FWIW: I think its correct. Node's .on("exit") is generally for healthy / intentional exits


Root Cause: The Copilot CLI subprocess (PID 181, MainThread) crashed and became a zombie process. The SDK's autoRestart feature should have reconnected, but there are issues:

  1. Zombie processes: The container has 15+ zombie processes (bash, git, MainThread) — all children of the bun process (PID 13). This indicates Node.js isn't properly reaping child processes.

  2. Timeline of failure: - Last logged event: session.truncation at 18:21:20.501Z

    • Followed by assistant.turn_start for turn 14
    • Then silence — no more events logged
    • The VM restarted twice (seen in log with duplicate NODE_ENV production lines)
    • But the old zombie processes remain from before the restarts
  3. Why autoRestart didn't work: The SDK's reconnect logic fires on exit event:

    this.cliProcess.on("exit", (code) => {
      if (this.options.autoRestart && this.state === "connected") {
        void this.reconnect();
      }
    });

    But if the stdio pipes get corrupted or the process crashes hard (SIGSEGV), the exit handler might not fire correctly.

  4. Most likely culprit: The Copilot CLI uses native prebuilds (keytar.node, pty.node). A crash in native code (segfault) would explain:

    • Abrupt stop of events after turn 14 started
    • Zombie state (parent didn't get proper exit notification)
    • No error logged

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions