[Perf] Zerocopy interop (migrated from Genesis repo)#450
[Perf] Zerocopy interop (migrated from Genesis repo)#450hughperkins wants to merge 17 commits intomainfrom
Conversation
Add a `copy` parameter (None/True/False) to all `to_torch()` and `to_numpy()` methods on Fields and Ndarrays. When `copy` is None (the default), the methods prefer zero-copy DLPack export with a cached torch.Tensor view, falling back to the existing kernel-based copy path when zero-copy is unavailable. - New module `_interop.py` centralises DLPack → torch conversion with per-object caching and Metal sync handling. - `to_pytorch_type` / `to_numpy_type` now use `lru_cache` to avoid repeated dict lookups on dtype conversion. - `Ndarray.to_torch()` added (previously ndarrays had no to_torch). - Ndarray._reset() clears the DLPack cache. - StructField.to_torch/to_numpy pass `copy` through to children.
Covers ScalarField, MatrixField (vector + matrix), all three Ndarray types, StructField pass-through, caching behaviour, copy=True/False semantics, memory aliasing, and error paths (device mismatch, dtype conversion with copy=False).
- Reformat inline imports to multi-line style (ruff I001) - Add pylint disable=C0415 for lazy torch imports in _interop.py - Use torch.from_dlpack() instead of torch.utils.dlpack.from_dlpack() to fix pyright reportAttributeAccessIssue
6e86a40 to
a7cd921
Compare
…test - Add dtype parameter to can_zerocopy(); only i32/i64/f32/f64/u1 are supported by the C++ DLPack implementation -- f16, u8, u16 etc. now correctly fall back to the kernel copy path - Pass dtype through all can_zerocopy() call sites (field.py, matrix.py) - Add can_zerocopy() gate to Ndarray.to_torch() with kernel copy fallback for unsupported dtypes/backends (was previously unconditional DLPack) - Force copy=True for StructField children since struct SNodes use interleaved memory that DLPack cannot represent correctly - Relax test_fused_kernels assertion: to_torch() may compile 0 kernels (zero-copy) or 1 kernel (copy path)
a7cd921 to
c0b3bdc
Compare
Several docstrings were incorrectly wrapped at 80 chars. Restored original formatting and consolidated copy parameter docs to single lines.
Restore original backtick styles, pylint/TODO comments, and docstring wrapping that were unnecessarily changed alongside the zero-copy work.
|
|
||
| Provides cached, zero-copy conversion of Fields and Ndarrays to PyTorch tensors using the DLPack protocol. | ||
| The returned tensors directly alias the Quadrants device memory -- modifications to either side are visible | ||
| in the other (after ``qd.sync()`` on Apple Metal). |
There was a problem hiding this comment.
(afterqd.sync() on Apple Metal) is inaccurate torch.mps.synchronize() is also required in case of copy.
| ) | ||
|
|
||
|
|
||
| @functools.lru_cache(maxsize=1) |
There was a problem hiding this comment.
LRU cache is slow. It should be profiled but I would not be surprised if systematic recomputation is faster. I would recommend defined a global constant instead of a function.
|
|
||
| Args: | ||
| is_field: ``True`` for SNode-backed Fields, ``False`` for Ndarrays. | ||
| dtype: The Quadrants dtype (e.g. ``qd.f32``). Types not in the C++ DLPack whitelist (f16, u8, …) return False. |
| parts = torch.__version__.replace("+", ".").split(".")[:3] | ||
| return tuple(map(int, parts)) > (2, 9, 1) |
There was a problem hiding this comment.
If you are already relying on package packing somewhere else, using it here would be more robust. If not, just keep this.
| def dlpack_to_torch(obj): | ||
| """Return a cached zero-copy ``torch.Tensor`` view of a Field or Ndarray via DLPack. | ||
|
|
||
| On the first call the DLPack capsule is created and wrapped into a ``torch.Tensor`` which is then stored on the | ||
| source object as ``_qd_dlpack_tc``. Subsequent calls return the cached tensor directly (O(1)). | ||
|
|
||
| On Apple Metal an explicit ``sync()`` is performed after the initial export to ensure data visibility. | ||
| """ | ||
| try: | ||
| return obj._qd_dlpack_tc | ||
| except AttributeError: | ||
| import torch # pylint: disable=C0415 | ||
|
|
||
| tc = torch.from_dlpack(obj.to_dlpack()) | ||
| obj._qd_dlpack_tc = tc | ||
| if impl.current_cfg().arch == _ARCH_METAL: | ||
| impl.get_runtime().sync() | ||
| return tc | ||
|
|
||
|
|
||
| def invalidate_zerocopy_cache(obj) -> None: | ||
| """Remove the cached DLPack torch tensor, if any.""" | ||
| try: | ||
| del obj._qd_dlpack_tc | ||
| except AttributeError: | ||
| pass |
There was a problem hiding this comment.
This caching mechanism is missing transpose. This is important for efficiency.
There was a problem hiding this comment.
This caching mechanism is missing numpy view. This is important for efficiency.
NumPy>=1.22 (genesis requires >=1.26) supports np.from_dlpack() natively. Using it would skip importing torch and the torch tensor intermediary for numpy-only workloads. Rather than caching a .numpy() derived from a torch tensor, a np.from_dlpack() view could be cached independently. It also avoids the torch import when the user never uses torch at all.
| from quadrants.lang._interop import ( # pylint: disable=C0415 | ||
| can_zerocopy, | ||
| dlpack_to_torch, | ||
| ) |
There was a problem hiding this comment.
I don't see any good reason to keep this import local. No obvious circular deps or direct dependency to torch. Maybe I'm missing something.
| Uses zero-copy when possible (all backends except Vulkan, and when the dtype is supported by DLPack). | ||
|
|
||
| Args: | ||
| copy: ``None``/``False`` return a zero-copy view, ``True`` returns an independent copy. |
There was a problem hiding this comment.
That is not what None does.
| if copy is False: | ||
| raise ValueError("Zero-copy not available for this backend/dtype combination") | ||
|
|
||
| import torch # pylint: disable=C0415 |
There was a problem hiding this comment.
Personally I prefer to move local imports in top of its local scope.
| try: | ||
| tc = dlpack_to_torch(self) | ||
| if tc.device.type == "cpu": | ||
| return tc.numpy() |
There was a problem hiding this comment.
This is very efficient. Not acceptable.
| if tc.device.type == "cpu": | ||
| return tc.numpy() | ||
| except (ImportError, RuntimeError): | ||
| pass |
There was a problem hiding this comment.
I'm not in favour of using try-except if the condition can be checked. What about storing a flag directly at instance level to keep track of whether zero-copy is supported by this tensor?
| if copy is False: | ||
| raise ValueError("Zero-copy to numpy requires a CPU backend") | ||
| elif copy is False: | ||
| raise ValueError("Zero-copy not available for this backend/type combination") |
There was a problem hiding this comment.
It would be nice to be more specific.
| as_vector = self.m == 1 and not keep_dims | ||
| shape_ext = (self.n,) if as_vector else (self.n, self.m) | ||
| expected = self.shape + shape_ext |
| as_vector = self.m == 1 and not keep_dims | ||
| shape_ext = (self.n,) if as_vector else (self.n, self.m) | ||
| expected = self.shape + shape_ext |
| dlpack_to_torch, | ||
| ) | ||
|
|
||
| if copy is not True and can_zerocopy(is_field=True, dtype=self.dtype, shape=self.shape): |
There was a problem hiding this comment.
This condition is wrong. You should prefer zero-copy even if a copy is requested. This is orthogonal. Zero-copy is always faster even if you need to copy in the end.
There was a problem hiding this comment.
You should use zerocopy systematically, and call torch.Tensor.numpy on cpu backend.
|
|
||
| if copy is not True and can_zerocopy(is_field=False, dtype=self.dtype): | ||
| tc = dlpack_to_torch(self) | ||
| return tc.clone() if copy is True else tc |
There was a problem hiding this comment.
if copy is True is a dead branch. But it should not be. You should just not guard against copy is not True.
| dlpack_to_torch, | ||
| ) | ||
|
|
||
| if copy is not True and can_zerocopy(is_field=False, dtype=self.dtype): |
There was a problem hiding this comment.
can_zerocopy(is_field=False, dtype=self.dtype) should be stored as a constant attribute of the instance for efficiency.
| return tc | ||
|
|
||
|
|
||
| def invalidate_zerocopy_cache(obj) -> None: |
There was a problem hiding this comment.
What about cache invalidation for ScalarField and MatrixField have no corresponding _reset hook?
| Dict[str, Union[numpy.ndarray, Dict]]: The result NumPy array. | ||
| """ | ||
| return {k: v.to_numpy() for k, v in self._items} | ||
| # Struct children have interleaved SNode memory; DLPack cannot represent this, so always use kernel copy. |
There was a problem hiding this comment.
struct members are separate SNode leaves with contiguous storage each, individual members could be zero-copied via DLPack from what I understand.
| tc = dlpack_to_torch(self) | ||
| return tc.clone() if copy is True else tc |
There was a problem hiding this comment.
You are missing call to torch.mps.synchronize() before clone on metal backend.
| raise ValueError( | ||
| f"copy=False is incompatible with device transfer (data on {tc.device}, requested {device})" | ||
| ) | ||
| return tc.to(device) |
There was a problem hiding this comment.
Missing call to torch.mps.synchronize() on metal backend.
- Use torch.utils.dlpack.from_dlpack to satisfy pyright's reportPrivateImportUsage - Convert quadrants DataTypeCxx to numpy dtype before astype() to fix reportArgumentType Made-with: Cursor
When torch is not installed, ScalarField.to_numpy() and MatrixField.to_numpy() were raising ModuleNotFoundError because the new zero-copy DLPack shortcut tries to import torch unconditionally. Wrap the dlpack_to_torch() call in a try/except ImportError (mirroring the existing handling in Ndarray._ndarray_to_numpy) so we silently fall back to the original kernel-copy path. This fixes test_random.py on the Vulkan CI runner, which has no torch installed. Made-with: Cursor
The new zero-copy to_torch method was added to MatrixNdarray, Ndarray, ScalarNdarray, and VectorNdarray, but the user_api expected lists in test_api.py weren't updated, causing test_api[*-Ndarray] failures. Made-with: Cursor
On Apple Metal, to_torch() now returns an mps:0 tensor via DLPack zero-copy. Three tests assumed the result was always on CPU and failed when mixed with CPU tensors: - test_zerocopy_interop._to_cpu: only moved to CPU on v520 amdgpu; now also moves whenever the tensor is not already on CPU (fixes test_matrix_field_to_torch_matches_copy). - test_torch_io.test_io_simple: compare to_torch() result on CPU. - test_torch_ad.test_torch_ad: ensure forward output and backward gradient match the input device (PyTorch autograd contract). Made-with: Cursor
The cached zero-copy MPS tensor view aliases Quadrants device memory, but on Apple Metal the MPS backend does not see pending Quadrants kernel writes until qd.sync() is called. Previously sync() ran only on the first export, so subsequent kernel writes (e.g. test_torch_ad's repeated forward/backward inside torch.autograd.Function) returned stale data when read back via .to(). Fixes test_torch_ad[arch=metal-2] which was reading zero gradients on iterations after the first. Made-with: Cursor
On GPU backends a DLPack-wrapped torch tensor cannot yield a zero-copy numpy view anyway, and caching the tensor on the field outlives ``qd.reset()`` -- the underlying device memory is freed by the reset, so the cached tensor's deleter later dereferences a dangling pointer and segfaults (observed in test_random_seed_per_program on AMD GPU). Made-with: Cursor
… requested Zero-copy numpy arrays returned via DLPack alias the underlying C++ runtime memory, which is freed by qd.reset(). This causes a use-after-free when callers naturally expect numpy arrays to outlive runtime state. The previous "default to zero-copy when possible" semantic for to_numpy() broke test_random_seed_per_program on AMD GPU (the test re-inits to the default x64 backend mid-test, then calls to_numpy() then qd.reset(), then np.allclose() segfaults reading freed memory). Restore the safe default: to_numpy() with copy=None or copy=True always returns an independent kernel-copied numpy array. Zero-copy is opt-in via copy=False, with the caller responsible for buffer lifetime. to_torch() is unchanged -- torch tensors are routinely shared and the caching/alias behaviour is well understood there. Made-with: Cursor
The new zero-copy to_torch() path returns a tensor backed by the field's device memory. On the AMD GPU CI runner, follow-up host-side operations (indexing, ==, .all()) trigger HIP kernel launches that fail with hipErrorInvalidDeviceFunction. Mirror the fix already applied to test_io_simple by calling .cpu() on the to_torch() result before performing host-side comparisons. Made-with: Cursor
|
@duburcqa conversation with Opus about this AI. Thoughts? User: this code was migrated from genesis, by an AI. Can you compare this code with what is currently in genesis, what is currently in quadrants main, and figure out why the code isnt ssimply automatically correct, from being ported from Genesis?Assistant: Why it isn't automatically correctThree structural reasons, in order of importance: 1. The original lives in the consumer, not the library. 2. The AI re-decided design choices instead of transcribing them. Genesis's design has clear invariants that are easy to miss when refactoring: "always try zerocopy first, then clone if requested" (not "guard zerocopy behind copy preference"); "sync MPS twice — once before reading, once after cloning"; "cache transpose and numpy variants too". The AI re-derived the algorithm from intent and got the boundaries wrong in 3. The Quadrants API is a different shape than the Genesis wrapper. Genesis exposes one entry point, so it gets one cache, one set of conditions, one sync sequence. Quadrants has four to six entry points ( A faithful port would have:
The current PR is roughly an interpretation of Genesis rather than a port, and most of duburcqa's review is asking for the original Genesis behaviour back. User: it sounds like you are recommending we do the actions under 'a faithful port', is that a fair impression?Assistant: Yes, that's a fair impression — but with one caveat I should make explicit. I was describing what a faithful port would have looked like, and that aligns closely with what duburcqa is asking for in the review. So if you accept the reviewer's framing, then yes, "a faithful port" is roughly the recommended path. The caveat: there's one place where Genesis itself may not be the right target. The use-after-free risk on Two other things worth flagging before you commit to that direction:
So my honest recommendation is: yes, do the faithful-port refactor as the structural backbone (one central helper, full cache shape, both syncs, correct |
Add a "Zero-copy interop via DLPack" section to the interop user guide, covering the copy=None/True/False semantics, support matrix (backends, dtypes, Metal/torch version), caching and invalidation rules, Apple Metal double-sync behaviour, lifetime caveats around qd.reset() / qd.init(), and StructField behaviour. Update the intro and summary table to advertise zero-copy as a first-class mechanism alongside copy-based interop and direct kernel pass-through.
Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough