Skip to content

[FIX] multi-GPU quantization OOM by canonicalizing get_supported_kwargs cache keys#2815

Merged
Qubitium merged 1 commit intomainfrom
zx_fix_oom
Apr 23, 2026
Merged

[FIX] multi-GPU quantization OOM by canonicalizing get_supported_kwargs cache keys#2815
Qubitium merged 1 commit intomainfrom
zx_fix_oom

Conversation

@ZX-ModelCloud
Copy link
Copy Markdown
Collaborator

@ZX-ModelCloud ZX-ModelCloud commented Apr 23, 2026

Summary

During multi-GPU quantization, the parallel replay path inspects module.forward on per-device replicas.

When get_supported_kwargs caches the bound method object itself, the cache key keeps a strong reference to the owning module replica. That prevents those replicas and their device tensors from being reclaimed after replay,
so VRAM usage grows across layers/subsets and eventually hits OOM.

Single-GPU runs are mostly unaffected because they usually stay on the serial path and do not repeatedly inspect per-replica bound forwards.

What Changed

  • canonicalize get_supported_kwargs cache keys to avoid retaining bound callables
  • cache Python bound methods by __func__ instead of the bound method object

fix #2805
fix #2810

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>
@Qubitium Qubitium merged commit f4827b6 into main Apr 23, 2026
6 checks passed
@Qubitium Qubitium deleted the zx_fix_oom branch April 23, 2026 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-GPU Quantization Always OOM During the gptq quantization process, the gpu memory usage increases until the oom?

2 participants