Skip to content

shim: handle connection-closed errors during kill after live migration#2673

Closed
shreyanshjain7174 wants to merge 10 commits intomicrosoft:mainfrom
shreyanshjain7174:fix/lm-ttrpc-closed
Closed

shim: handle connection-closed errors during kill after live migration#2673
shreyanshjain7174 wants to merge 10 commits intomicrosoft:mainfrom
shreyanshjain7174:fix/lm-ttrpc-closed

Conversation

@shreyanshjain7174
Copy link
Copy Markdown
Contributor

@shreyanshjain7174 shreyanshjain7174 commented Apr 10, 2026

Closing — the fix was in the wrong layer (V1 shim instead of V2 taskserver). The correct fix is in rawahars#14 against the live_migration_poc_4 branch.

@shreyanshjain7174 shreyanshjain7174 requested a review from a team as a code owner April 10, 2026 07:06
…wn race

After HCS live migration completes, FinalizeSourceLM calls
FinalizeSandbox(STOP) which calls LMKill to finalize the HCS system.
This causes the VM to exit, which the waitContainer goroutines detect
via c.Wait(). Those goroutines race to shut down the shim before
containerd can call Kill/Delete via the task ttrpc service, resulting
in 'ttrpc: closed' errors that surface as StopSourceVMFailure.

Fix: Cancel the waitContainer context before calling LMKill, following
the same pattern already used in TransferSandbox. This prevents the
goroutines from racing to shut down the shim. Also keep s.sandbox
alive (instead of nilling it) so that the subsequent Kill and Delete
calls from containerd succeed — Kill calls Terminate which returns nil
for an already-stopped system, and Delete returns the cached exit
state.

Fixes: AB#61773098
Signed-off-by: Shreyansh Sancheti <shsancheti@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants