[Perf] Zerocopy interop (migrated from Genesis repo) by hughperkins · Pull Request #450 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-04-05T15:57:45Z

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Add a `copy` parameter (None/True/False) to all `to_torch()` and `to_numpy()` methods on Fields and Ndarrays. When `copy` is None (the default), the methods prefer zero-copy DLPack export with a cached torch.Tensor view, falling back to the existing kernel-based copy path when zero-copy is unavailable. - New module `_interop.py` centralises DLPack → torch conversion with per-object caching and Metal sync handling. - `to_pytorch_type` / `to_numpy_type` now use `lru_cache` to avoid repeated dict lookups on dtype conversion. - `Ndarray.to_torch()` added (previously ndarrays had no to_torch). - Ndarray._reset() clears the DLPack cache. - StructField.to_torch/to_numpy pass `copy` through to children.

Covers ScalarField, MatrixField (vector + matrix), all three Ndarray types, StructField pass-through, caching behaviour, copy=True/False semantics, memory aliasing, and error paths (device mismatch, dtype conversion with copy=False).

- Reformat inline imports to multi-line style (ruff I001) - Add pylint disable=C0415 for lazy torch imports in _interop.py - Use torch.from_dlpack() instead of torch.utils.dlpack.from_dlpack() to fix pyright reportAttributeAccessIssue

…test - Add dtype parameter to can_zerocopy(); only i32/i64/f32/f64/u1 are supported by the C++ DLPack implementation -- f16, u8, u16 etc. now correctly fall back to the kernel copy path - Pass dtype through all can_zerocopy() call sites (field.py, matrix.py) - Add can_zerocopy() gate to Ndarray.to_torch() with kernel copy fallback for unsupported dtypes/backends (was previously unconditional DLPack) - Force copy=True for StructField children since struct SNodes use interleaved memory that DLPack cannot represent correctly - Relax test_fused_kernels assertion: to_torch() may compile 0 kernels (zero-copy) or 1 kernel (copy path)

Several docstrings were incorrectly wrapped at 80 chars. Restored original formatting and consolidated copy parameter docs to single lines.

Restore original backtick styles, pylint/TODO comments, and docstring wrapping that were unnecessarily changed alongside the zero-copy work.

duburcqa · 2026-04-06T07:51:21Z

+
+Provides cached, zero-copy conversion of Fields and Ndarrays to PyTorch tensors using the DLPack protocol.
+The returned tensors directly alias the Quadrants device memory -- modifications to either side are visible
+in the other (after ``qd.sync()`` on Apple Metal).


(afterqd.sync() on Apple Metal) is inaccurate torch.mps.synchronize() is also required in case of copy.

duburcqa · 2026-04-06T07:53:41Z

+)
+
+
+@functools.lru_cache(maxsize=1)


LRU cache is slow. It should be profiled but I would not be surprised if systematic recomputation is faster. I would recommend defined a global constant instead of a function.

duburcqa · 2026-04-06T07:54:29Z

+
+    Args:
+        is_field: ``True`` for SNode-backed Fields, ``False`` for Ndarrays.
+        dtype: The Quadrants dtype (e.g. ``qd.f32``).  Types not in the C++ DLPack whitelist (f16, u8, …) return False.


Avoid non-ascii char '…'.

duburcqa · 2026-04-06T07:55:35Z

+        parts = torch.__version__.replace("+", ".").split(".")[:3]
+        return tuple(map(int, parts)) > (2, 9, 1)


If you are already relying on package packing somewhere else, using it here would be more robust. If not, just keep this.

duburcqa · 2026-04-06T07:57:27Z

+def dlpack_to_torch(obj):
+    """Return a cached zero-copy ``torch.Tensor`` view of a Field or Ndarray via DLPack.
+
+    On the first call the DLPack capsule is created and wrapped into a ``torch.Tensor`` which is then stored on the
+    source object as ``_qd_dlpack_tc``.  Subsequent calls return the cached tensor directly (O(1)).
+
+    On Apple Metal an explicit ``sync()`` is performed after the initial export to ensure data visibility.
+    """
+    try:
+        return obj._qd_dlpack_tc
+    except AttributeError:
+        import torch  # pylint: disable=C0415
+
+        tc = torch.from_dlpack(obj.to_dlpack())
+        obj._qd_dlpack_tc = tc
+        if impl.current_cfg().arch == _ARCH_METAL:
+            impl.get_runtime().sync()
+        return tc
+
+
+def invalidate_zerocopy_cache(obj) -> None:
+    """Remove the cached DLPack torch tensor, if any."""
+    try:
+        del obj._qd_dlpack_tc
+    except AttributeError:
+        pass


This caching mechanism is missing transpose. This is important for efficiency.

This caching mechanism is missing numpy view. This is important for efficiency.
NumPy>=1.22 (genesis requires >=1.26) supports np.from_dlpack() natively. Using it would skip importing torch and the torch tensor intermediary for numpy-only workloads. Rather than caching a .numpy() derived from a torch tensor, a np.from_dlpack() view could be cached independently. It also avoids the torch import when the user never uses torch at all.

duburcqa · 2026-04-06T07:59:19Z

+        from quadrants.lang._interop import (  # pylint: disable=C0415
+            can_zerocopy,
+            dlpack_to_torch,
+        )


I don't see any good reason to keep this import local. No obvious circular deps or direct dependency to torch. Maybe I'm missing something.

duburcqa · 2026-04-06T07:59:37Z

+        Uses zero-copy when possible (all backends except Vulkan, and when the dtype is supported by DLPack).
+
+        Args:
+            copy: ``None``/``False`` return a zero-copy view, ``True`` returns an independent copy.


That is not what None does.

duburcqa · 2026-04-06T08:00:32Z

+        if copy is False:
+            raise ValueError("Zero-copy not available for this backend/dtype combination")
+
+        import torch  # pylint: disable=C0415


Personally I prefer to move local imports in top of its local scope.

duburcqa · 2026-04-06T08:04:11Z

+            try:
+                tc = dlpack_to_torch(self)
+                if tc.device.type == "cpu":
+                    return tc.numpy()


This is very efficient. Not acceptable.

duburcqa · 2026-04-06T08:06:12Z

+                if tc.device.type == "cpu":
+                    return tc.numpy()
+            except (ImportError, RuntimeError):
+                pass


I'm not in favour of using try-except if the condition can be checked. What about storing a flag directly at instance level to keep track of whether zero-copy is supported by this tensor?

duburcqa · 2026-04-06T08:08:37Z

+            if copy is False:
+                raise ValueError("Zero-copy to numpy requires a CPU backend")
+        elif copy is False:
+            raise ValueError("Zero-copy not available for this backend/type combination")


It would be nice to be more specific.

duburcqa · 2026-04-06T08:14:22Z

+            as_vector = self.m == 1 and not keep_dims
+            shape_ext = (self.n,) if as_vector else (self.n, self.m)
+            expected = self.shape + shape_ext


Duplicated logics.

duburcqa · 2026-04-06T08:14:29Z

+                as_vector = self.m == 1 and not keep_dims
+                shape_ext = (self.n,) if as_vector else (self.n, self.m)
+                expected = self.shape + shape_ext


Duplicated logics.

duburcqa · 2026-04-06T08:16:39Z

+            dlpack_to_torch,
+        )
+
+        if copy is not True and can_zerocopy(is_field=True, dtype=self.dtype, shape=self.shape):


This condition is wrong. You should prefer zero-copy even if a copy is requested. This is orthogonal. Zero-copy is always faster even if you need to copy in the end.

You should use zerocopy systematically, and call torch.Tensor.numpy on cpu backend.

duburcqa · 2026-04-06T08:25:22Z

+
+        if copy is not True and can_zerocopy(is_field=False, dtype=self.dtype):
+            tc = dlpack_to_torch(self)
+            return tc.clone() if copy is True else tc


if copy is True is a dead branch. But it should not be. You should just not guard against copy is not True.

duburcqa · 2026-04-06T08:25:44Z

+            dlpack_to_torch,
+        )
+
+        if copy is not True and can_zerocopy(is_field=False, dtype=self.dtype):


can_zerocopy(is_field=False, dtype=self.dtype) should be stored as a constant attribute of the instance for efficiency.

duburcqa · 2026-04-06T08:27:54Z

+        return tc
+
+
+def invalidate_zerocopy_cache(obj) -> None:


What about cache invalidation for ScalarField and MatrixField have no corresponding _reset hook?

duburcqa · 2026-04-06T08:29:16Z

            Dict[str, Union[numpy.ndarray, Dict]]: The result NumPy array.
        """
-        return {k: v.to_numpy() for k, v in self._items}
+        # Struct children have interleaved SNode memory; DLPack cannot represent this, so always use kernel copy.


struct members are separate SNode leaves with contiguous storage each, individual members could be zero-copied via DLPack from what I understand.

duburcqa · 2026-04-06T08:44:15Z

+            tc = dlpack_to_torch(self)
+            return tc.clone() if copy is True else tc


You are missing call to torch.mps.synchronize() before clone on metal backend.

duburcqa · 2026-04-06T08:45:07Z

+                    raise ValueError(
+                        f"copy=False is incompatible with device transfer (data on {tc.device}, requested {device})"
+                    )
+                return tc.to(device)


Missing call to torch.mps.synchronize() on metal backend.

- Use torch.utils.dlpack.from_dlpack to satisfy pyright's reportPrivateImportUsage - Convert quadrants DataTypeCxx to numpy dtype before astype() to fix reportArgumentType Made-with: Cursor

When torch is not installed, ScalarField.to_numpy() and MatrixField.to_numpy() were raising ModuleNotFoundError because the new zero-copy DLPack shortcut tries to import torch unconditionally. Wrap the dlpack_to_torch() call in a try/except ImportError (mirroring the existing handling in Ndarray._ndarray_to_numpy) so we silently fall back to the original kernel-copy path. This fixes test_random.py on the Vulkan CI runner, which has no torch installed. Made-with: Cursor

The new zero-copy to_torch method was added to MatrixNdarray, Ndarray, ScalarNdarray, and VectorNdarray, but the user_api expected lists in test_api.py weren't updated, causing test_api[*-Ndarray] failures. Made-with: Cursor

On Apple Metal, to_torch() now returns an mps:0 tensor via DLPack zero-copy. Three tests assumed the result was always on CPU and failed when mixed with CPU tensors: - test_zerocopy_interop._to_cpu: only moved to CPU on v520 amdgpu; now also moves whenever the tensor is not already on CPU (fixes test_matrix_field_to_torch_matches_copy). - test_torch_io.test_io_simple: compare to_torch() result on CPU. - test_torch_ad.test_torch_ad: ensure forward output and backward gradient match the input device (PyTorch autograd contract). Made-with: Cursor

The cached zero-copy MPS tensor view aliases Quadrants device memory, but on Apple Metal the MPS backend does not see pending Quadrants kernel writes until qd.sync() is called. Previously sync() ran only on the first export, so subsequent kernel writes (e.g. test_torch_ad's repeated forward/backward inside torch.autograd.Function) returned stale data when read back via .to(). Fixes test_torch_ad[arch=metal-2] which was reading zero gradients on iterations after the first. Made-with: Cursor

On GPU backends a DLPack-wrapped torch tensor cannot yield a zero-copy numpy view anyway, and caching the tensor on the field outlives ``qd.reset()`` -- the underlying device memory is freed by the reset, so the cached tensor's deleter later dereferences a dangling pointer and segfaults (observed in test_random_seed_per_program on AMD GPU). Made-with: Cursor

… requested Zero-copy numpy arrays returned via DLPack alias the underlying C++ runtime memory, which is freed by qd.reset(). This causes a use-after-free when callers naturally expect numpy arrays to outlive runtime state. The previous "default to zero-copy when possible" semantic for to_numpy() broke test_random_seed_per_program on AMD GPU (the test re-inits to the default x64 backend mid-test, then calls to_numpy() then qd.reset(), then np.allclose() segfaults reading freed memory). Restore the safe default: to_numpy() with copy=None or copy=True always returns an independent kernel-copied numpy array. Zero-copy is opt-in via copy=False, with the caller responsible for buffer lifetime. to_torch() is unchanged -- torch tensors are routinely shared and the caching/alias behaviour is well understood there. Made-with: Cursor

The new zero-copy to_torch() path returns a tensor backed by the field's device memory. On the AMD GPU CI runner, follow-up host-side operations (indexing, ==, .all()) trigger HIP kernel launches that fail with hipErrorInvalidDeviceFunction. Mirror the fix already applied to test_io_simple by calling .cpu() on the to_torch() result before performing host-side comparisons. Made-with: Cursor

hughperkins · 2026-04-22T22:17:32Z

@duburcqa conversation with Opus about this AI. Thoughts?

User: this code was migrated from genesis, by an AI. Can you compare this code with what is currently in genesis, what is currently in quadrants main, and figure out why the code isnt ssimply automatically correct, from being ported from Genesis?

Assistant:

Why it isn't automatically correct

Three structural reasons, in order of importance:

1. The original lives in the consumer, not the library. qd_to_python is in Genesis. Genesis owns the entire call path: it controls use_zerocopy, gs.backend, and it owns when qd.reset() is called relative to cache lifetime. When you push the same logic down into Quadrants' own classes, you inherit a different threat model: arbitrary user code can call qd.reset() between the cache fill and the next read, so cached views become use-after-free. The PR's commit history shows the AI hitting exactly this â€” segfaults on AMD GPU and Metal â€” and patching around it by disabling the numpy cache and changing the default of to_numpy mid-PR. The reviewer's comment #18 (no _reset hook on ScalarField/MatrixField) is the still-unfixed version of the same problem.

2. The AI re-decided design choices instead of transcribing them. Genesis's design has clear invariants that are easy to miss when refactoring: "always try zerocopy first, then clone if requested" (not "guard zerocopy behind copy preference"); "sync MPS twice â€” once before reading, once after cloning"; "cache transpose and numpy variants too". The AI re-derived the algorithm from intent and got the boundaries wrong in if copy is not True and can_zerocopy(...), dropped the second MPS sync, and shrunk the cache to a single attribute. These map directly to comments #15/#16/#21, #1/#22/#23, and #5/#6.

3. The Quadrants API is a different shape than the Genesis wrapper. Genesis exposes one entry point, so it gets one cache, one set of conditions, one sync sequence. Quadrants has four to six entry points (Ndarray.to_torch, Ndarray._ndarray_to_numpy, Ndarray._ndarray_matrix_to_numpy, ScalarField.to_torch/to_numpy, MatrixField.to_torch/to_numpy, StructField.to_torch/to_numpy). Each got a copy-pasted version of the same boilerplate, drifting slightly between sites â€” comments #13 and #14 are just "you have the same logic twice in matrix.py", and the situation is the same across files. The Genesis design can't be ported one-method-at-a-time without first factoring a single helper that all the methods call.

A faithful port would have:

Put the whole Genesis algorithm into _interop.py as one helper (qd_to_python-equivalent), kept the four cache attributes (_tc/_T_tc/_np/_T_np), replaced gs.use_zerocopy/gs.backend with can_zerocopy(...) evaluated inside that helper, and made every to_torch/to_numpy a one-liner that calls it.
Added _reset hooks to every Field/Ndarray subclass before exposing the cache, since Quadrants exposes qd.reset() to users that Genesis never had to model.
Carried over both qd.sync() and torch.mps.synchronize() verbatim.
Kept the Genesis decision rule literally: "if zerocopy supported, take that path; clone iff copy=True."

The current PR is roughly an interpretation of Genesis rather than a port, and most of duburcqa's review is asking for the original Genesis behaviour back.

User: it sounds like you are recommending we do the actions under 'a faithful port', is that a fair impression?

Assistant:

Yes, that's a fair impression â€” but with one caveat I should make explicit.

I was describing what a faithful port would have looked like, and that aligns closely with what duburcqa is asking for in the review. So if you accept the reviewer's framing, then yes, "a faithful port" is roughly the recommended path.

The caveat: there's one place where Genesis itself may not be the right target. The use-after-free risk on qd.reset() is real and not present in Genesis (because Genesis owns the reset call). Inside Quadrants, you can't just copy Genesis verbatim â€” you also need invalidation hooks on every cache holder. So it's "faithful port + reset hooks", not pure faithful port.

Two other things worth flagging before you commit to that direction:

Genesis's design isn't universally praised in the review either. The reviewer's comment [Build] Add clang-tidy linter, and fix lint errors #11 ("don't use try/except, use a flag") is actually a critique of a pattern Genesis uses too. So a fully faithful port would still draw that comment. Minor, but worth knowing.
The to_numpy lifetime question is a genuine design call, not just a port mistake. Your fix commit on 2026-04-20 ("Make to_numpy() always copy unless copy=False is explicitly requested") was reacting to a real segfault. Genesis defaults to zero-copy because Genesis users know about the model. Quadrants has a wider audience and numpy arrays are conventionally expected to outlive their source. Reverting to "default zero-copy for to_numpy" to match Genesis would re-introduce that hazard unless you also fix invalidation. It's worth deciding deliberately rather than just matching Genesis.

So my honest recommendation is: yes, do the faithful-port refactor as the structural backbone (one central helper, full cache shape, both syncs, correct copy=True â‡’ zerocopy-then-clone), AND add reset hooks on all field types, AND have an explicit conversation with duburcqa about the to_numpy default before changing it back.

Add a "Zero-copy interop via DLPack" section to the interop user guide, covering the copy=None/True/False semantics, support matrix (backends, dtypes, Metal/torch version), caching and invalidation rules, Apple Metal double-sync behaviour, lifetime caveats around qd.reset() / qd.init(), and StructField behaviour. Update the intro and summary table to advertise zero-copy as a first-class mechanism alongside copy-based interop and direct kernel pass-through.

hughperkins added 3 commits April 5, 2026 08:07

[Interop] Add tests for zero-copy to_torch/to_numpy

71634d2

Covers ScalarField, MatrixField (vector + matrix), all three Ndarray types, StructField pass-through, caching behaviour, copy=True/False semantics, memory aliasing, and error paths (device mismatch, dtype conversion with copy=False).

hughperkins force-pushed the hp/zerocopy-interop branch from 6e86a40 to a7cd921 Compare April 5, 2026 17:11

hughperkins force-pushed the hp/zerocopy-interop branch from a7cd921 to c0b3bdc Compare April 5, 2026 17:15

hughperkins added 3 commits April 5, 2026 10:19

[Interop] Unwrap docstrings/comments to 120-char line width

a4a7180

Several docstrings were incorrectly wrapped at 80 chars. Restored original formatting and consolidated copy parameter docs to single lines.

[Interop] Restore original single-line warn() string in field.py

bef0a21

[Interop] Minimize diff: revert cosmetic changes to origin/main

9e96e2d

Restore original backtick styles, pylint/TODO comments, and docstring wrapping that were unnecessarily changed alongside the zero-copy work.

duburcqa reviewed Apr 6, 2026

View reviewed changes

[Interop] Fix pyright errors in zerocopy interop

13a4c0d

- Use torch.utils.dlpack.from_dlpack to satisfy pyright's reportPrivateImportUsage - Convert quadrants DataTypeCxx to numpy dtype before astype() to fix reportArgumentType Made-with: Cursor

hughperkins added 8 commits April 19, 2026 16:34

[Interop] Add to_torch to expected ndarray APIs in test_api

2978b1b

The new zero-copy to_torch method was added to MatrixNdarray, Ndarray, ScalarNdarray, and VectorNdarray, but the user_api expected lists in test_api.py weren't updated, causing test_api[*-Ndarray] failures. Made-with: Cursor

Merge branch 'main' into hp/zerocopy-interop

bfc5a8d

		parts = torch.__version__.replace("+", ".").split(".")[:3]
		return tuple(map(int, parts)) > (2, 9, 1)

		tc = dlpack_to_torch(self)
		return tc.clone() if copy is True else tc

		)


		@functools.lru_cache(maxsize=1)

Conversation

hughperkins commented Apr 5, 2026

Brief Summary

Walkthrough

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duburcqa Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duburcqa Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hughperkins commented Apr 22, 2026

Why it isn't automatically correct

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

duburcqa Apr 6, 2026 •

edited

Loading

duburcqa Apr 6, 2026 •

edited

Loading