[FIX] multi-GPU quantization OOM by canonicalizing get_supported_kwargs cache keys by ZX-ModelCloud · Pull Request #2815 · ModelCloud/GPTQModel

ZX-ModelCloud · 2026-04-23T06:49:51Z

Summary

During multi-GPU quantization, the parallel replay path inspects module.forward on per-device replicas.

When get_supported_kwargs caches the bound method object itself, the cache key keeps a strong reference to the owning module replica. That prevents those replicas and their device tensors from being reclaimed after replay,
so VRAM usage grows across layers/subsets and eventually hits OOM.

Single-GPU runs are mostly unaffected because they usually stay on the serial path and do not repeatedly inspect per-replica bound forwards.

What Changed

canonicalize get_supported_kwargs cache keys to avoid retaining bound callables
cache Python bound methods by __func__ instead of the bound method object

fix #2805
fix #2810

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

fix: avoid retaining bound callables in get_supported_kwargs cache

fc6019b

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

Qubitium merged commit f4827b6 into main Apr 23, 2026
6 checks passed

Qubitium deleted the zx_fix_oom branch April 23, 2026 09:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] multi-GPU quantization OOM by canonicalizing get_supported_kwargs cache keys#2815

[FIX] multi-GPU quantization OOM by canonicalizing get_supported_kwargs cache keys#2815
Qubitium merged 1 commit intomainfrom
zx_fix_oom

ZX-ModelCloud commented Apr 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ZX-ModelCloud commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ZX-ModelCloud commented Apr 23, 2026 •

edited

Loading