Skip to content

[Perf] Zerocopy interop (migrated from Genesis repo)#450

Draft
hughperkins wants to merge 17 commits intomainfrom
hp/zerocopy-interop
Draft

[Perf] Zerocopy interop (migrated from Genesis repo)#450
hughperkins wants to merge 17 commits intomainfrom
hp/zerocopy-interop

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Add a `copy` parameter (None/True/False) to all `to_torch()` and
`to_numpy()` methods on Fields and Ndarrays.  When `copy` is None
(the default), the methods prefer zero-copy DLPack export with a
cached torch.Tensor view, falling back to the existing kernel-based
copy path when zero-copy is unavailable.

- New module `_interop.py` centralises DLPack → torch conversion
  with per-object caching and Metal sync handling.
- `to_pytorch_type` / `to_numpy_type` now use `lru_cache` to avoid
  repeated dict lookups on dtype conversion.
- `Ndarray.to_torch()` added (previously ndarrays had no to_torch).
- Ndarray._reset() clears the DLPack cache.
- StructField.to_torch/to_numpy pass `copy` through to children.
Covers ScalarField, MatrixField (vector + matrix), all three Ndarray
types, StructField pass-through, caching behaviour, copy=True/False
semantics, memory aliasing, and error paths (device mismatch, dtype
conversion with copy=False).
- Reformat inline imports to multi-line style (ruff I001)
- Add pylint disable=C0415 for lazy torch imports in _interop.py
- Use torch.from_dlpack() instead of torch.utils.dlpack.from_dlpack()
  to fix pyright reportAttributeAccessIssue
@hughperkins hughperkins force-pushed the hp/zerocopy-interop branch from 6e86a40 to a7cd921 Compare April 5, 2026 17:11
…test

- Add dtype parameter to can_zerocopy(); only i32/i64/f32/f64/u1 are
  supported by the C++ DLPack implementation -- f16, u8, u16 etc. now
  correctly fall back to the kernel copy path
- Pass dtype through all can_zerocopy() call sites (field.py, matrix.py)
- Add can_zerocopy() gate to Ndarray.to_torch() with kernel copy fallback
  for unsupported dtypes/backends (was previously unconditional DLPack)
- Force copy=True for StructField children since struct SNodes use
  interleaved memory that DLPack cannot represent correctly
- Relax test_fused_kernels assertion: to_torch() may compile 0 kernels
  (zero-copy) or 1 kernel (copy path)
@hughperkins hughperkins force-pushed the hp/zerocopy-interop branch from a7cd921 to c0b3bdc Compare April 5, 2026 17:15
Several docstrings were incorrectly wrapped at 80 chars. Restored
original formatting and consolidated copy parameter docs to single lines.
Restore original backtick styles, pylint/TODO comments, and docstring
wrapping that were unnecessarily changed alongside the zero-copy work.

Provides cached, zero-copy conversion of Fields and Ndarrays to PyTorch tensors using the DLPack protocol.
The returned tensors directly alias the Quadrants device memory -- modifications to either side are visible
in the other (after ``qd.sync()`` on Apple Metal).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(afterqd.sync() on Apple Metal) is inaccurate torch.mps.synchronize() is also required in case of copy.

)


@functools.lru_cache(maxsize=1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LRU cache is slow. It should be profiled but I would not be surprised if systematic recomputation is faster. I would recommend defined a global constant instead of a function.


Args:
is_field: ``True`` for SNode-backed Fields, ``False`` for Ndarrays.
dtype: The Quadrants dtype (e.g. ``qd.f32``). Types not in the C++ DLPack whitelist (f16, u8, …) return False.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid non-ascii char '…'.

Comment on lines +34 to +35
parts = torch.__version__.replace("+", ".").split(".")[:3]
return tuple(map(int, parts)) > (2, 9, 1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are already relying on package packing somewhere else, using it here would be more robust. If not, just keep this.

Comment on lines +66 to +91
def dlpack_to_torch(obj):
"""Return a cached zero-copy ``torch.Tensor`` view of a Field or Ndarray via DLPack.

On the first call the DLPack capsule is created and wrapped into a ``torch.Tensor`` which is then stored on the
source object as ``_qd_dlpack_tc``. Subsequent calls return the cached tensor directly (O(1)).

On Apple Metal an explicit ``sync()`` is performed after the initial export to ensure data visibility.
"""
try:
return obj._qd_dlpack_tc
except AttributeError:
import torch # pylint: disable=C0415

tc = torch.from_dlpack(obj.to_dlpack())
obj._qd_dlpack_tc = tc
if impl.current_cfg().arch == _ARCH_METAL:
impl.get_runtime().sync()
return tc


def invalidate_zerocopy_cache(obj) -> None:
"""Remove the cached DLPack torch tensor, if any."""
try:
del obj._qd_dlpack_tc
except AttributeError:
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This caching mechanism is missing transpose. This is important for efficiency.

Copy link
Copy Markdown
Contributor

@duburcqa duburcqa Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This caching mechanism is missing numpy view. This is important for efficiency.
NumPy>=1.22 (genesis requires >=1.26) supports np.from_dlpack() natively. Using it would skip importing torch and the torch tensor intermediary for numpy-only workloads. Rather than caching a .numpy() derived from a torch tensor, a np.from_dlpack() view could be cached independently. It also avoids the torch import when the user never uses torch at all.

Comment on lines +78 to +81
from quadrants.lang._interop import ( # pylint: disable=C0415
can_zerocopy,
dlpack_to_torch,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any good reason to keep this import local. No obvious circular deps or direct dependency to torch. Maybe I'm missing something.

Uses zero-copy when possible (all backends except Vulkan, and when the dtype is supported by DLPack).

Args:
copy: ``None``/``False`` return a zero-copy view, ``True`` returns an independent copy.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not what None does.

if copy is False:
raise ValueError("Zero-copy not available for this backend/dtype combination")

import torch # pylint: disable=C0415
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I prefer to move local imports in top of its local scope.

Comment thread python/quadrants/lang/_ndarray.py Outdated
try:
tc = dlpack_to_torch(self)
if tc.device.type == "cpu":
return tc.numpy()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very efficient. Not acceptable.

Comment thread python/quadrants/lang/_ndarray.py Outdated
if tc.device.type == "cpu":
return tc.numpy()
except (ImportError, RuntimeError):
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not in favour of using try-except if the condition can be checked. What about storing a flag directly at instance level to keep track of whether zero-copy is supported by this tensor?

Comment thread python/quadrants/lang/field.py Outdated
if copy is False:
raise ValueError("Zero-copy to numpy requires a CPU backend")
elif copy is False:
raise ValueError("Zero-copy not available for this backend/type combination")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to be more specific.

Comment on lines +1376 to +1378
as_vector = self.m == 1 and not keep_dims
shape_ext = (self.n,) if as_vector else (self.n, self.m)
expected = self.shape + shape_ext
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated logics.

Comment thread python/quadrants/lang/matrix.py Outdated
Comment on lines +1328 to +1330
as_vector = self.m == 1 and not keep_dims
shape_ext = (self.n,) if as_vector else (self.n, self.m)
expected = self.shape + shape_ext
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated logics.

Comment thread python/quadrants/lang/matrix.py Outdated
dlpack_to_torch,
)

if copy is not True and can_zerocopy(is_field=True, dtype=self.dtype, shape=self.shape):
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is wrong. You should prefer zero-copy even if a copy is requested. This is orthogonal. Zero-copy is always faster even if you need to copy in the end.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use zerocopy systematically, and call torch.Tensor.numpy on cpu backend.


if copy is not True and can_zerocopy(is_field=False, dtype=self.dtype):
tc = dlpack_to_torch(self)
return tc.clone() if copy is True else tc
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if copy is True is a dead branch. But it should not be. You should just not guard against copy is not True.

dlpack_to_torch,
)

if copy is not True and can_zerocopy(is_field=False, dtype=self.dtype):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can_zerocopy(is_field=False, dtype=self.dtype) should be stored as a constant attribute of the instance for efficiency.

return tc


def invalidate_zerocopy_cache(obj) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about cache invalidation for ScalarField and MatrixField have no corresponding _reset hook?

Dict[str, Union[numpy.ndarray, Dict]]: The result NumPy array.
"""
return {k: v.to_numpy() for k, v in self._items}
# Struct children have interleaved SNode memory; DLPack cannot represent this, so always use kernel copy.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct members are separate SNode leaves with contiguous storage each, individual members could be zero-copied via DLPack from what I understand.

Comment on lines +84 to +85
tc = dlpack_to_torch(self)
return tc.clone() if copy is True else tc
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are missing call to torch.mps.synchronize() before clone on metal backend.

raise ValueError(
f"copy=False is incompatible with device transfer (data on {tc.device}, requested {device})"
)
return tc.to(device)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing call to torch.mps.synchronize() on metal backend.

- Use torch.utils.dlpack.from_dlpack to satisfy pyright's reportPrivateImportUsage
- Convert quadrants DataTypeCxx to numpy dtype before astype() to fix reportArgumentType

Made-with: Cursor
When torch is not installed, ScalarField.to_numpy() and MatrixField.to_numpy()
were raising ModuleNotFoundError because the new zero-copy DLPack shortcut tries
to import torch unconditionally. Wrap the dlpack_to_torch() call in a try/except
ImportError (mirroring the existing handling in Ndarray._ndarray_to_numpy) so we
silently fall back to the original kernel-copy path. This fixes test_random.py
on the Vulkan CI runner, which has no torch installed.

Made-with: Cursor
The new zero-copy to_torch method was added to MatrixNdarray, Ndarray,
ScalarNdarray, and VectorNdarray, but the user_api expected lists in
test_api.py weren't updated, causing test_api[*-Ndarray] failures.

Made-with: Cursor
On Apple Metal, to_torch() now returns an mps:0 tensor via DLPack zero-copy.
Three tests assumed the result was always on CPU and failed when mixed with
CPU tensors:
- test_zerocopy_interop._to_cpu: only moved to CPU on v520 amdgpu; now also
  moves whenever the tensor is not already on CPU (fixes
  test_matrix_field_to_torch_matches_copy).
- test_torch_io.test_io_simple: compare to_torch() result on CPU.
- test_torch_ad.test_torch_ad: ensure forward output and backward gradient
  match the input device (PyTorch autograd contract).

Made-with: Cursor
The cached zero-copy MPS tensor view aliases Quadrants device memory, but on
Apple Metal the MPS backend does not see pending Quadrants kernel writes until
qd.sync() is called.  Previously sync() ran only on the first export, so
subsequent kernel writes (e.g. test_torch_ad's repeated forward/backward
inside torch.autograd.Function) returned stale data when read back via .to().

Fixes test_torch_ad[arch=metal-2] which was reading zero gradients on
iterations after the first.

Made-with: Cursor
On GPU backends a DLPack-wrapped torch tensor cannot yield a zero-copy
numpy view anyway, and caching the tensor on the field outlives
``qd.reset()`` -- the underlying device memory is freed by the reset,
so the cached tensor's deleter later dereferences a dangling pointer
and segfaults (observed in test_random_seed_per_program on AMD GPU).

Made-with: Cursor
… requested

Zero-copy numpy arrays returned via DLPack alias the underlying C++
runtime memory, which is freed by qd.reset().  This causes a use-after-free
when callers naturally expect numpy arrays to outlive runtime state.

The previous "default to zero-copy when possible" semantic for to_numpy()
broke test_random_seed_per_program on AMD GPU (the test re-inits to the
default x64 backend mid-test, then calls to_numpy() then qd.reset(), then
np.allclose() segfaults reading freed memory).

Restore the safe default: to_numpy() with copy=None or copy=True always
returns an independent kernel-copied numpy array.  Zero-copy is opt-in
via copy=False, with the caller responsible for buffer lifetime.
to_torch() is unchanged -- torch tensors are routinely shared and the
caching/alias behaviour is well understood there.

Made-with: Cursor
The new zero-copy to_torch() path returns a tensor backed by the field's
device memory. On the AMD GPU CI runner, follow-up host-side operations
(indexing, ==, .all()) trigger HIP kernel launches that fail with
hipErrorInvalidDeviceFunction. Mirror the fix already applied to
test_io_simple by calling .cpu() on the to_torch() result before
performing host-side comparisons.

Made-with: Cursor
@hughperkins
Copy link
Copy Markdown
Collaborator Author

@duburcqa conversation with Opus about this AI. Thoughts?

User: this code was migrated from genesis, by an AI. Can you compare this code with what is currently in genesis, what is currently in quadrants main, and figure out why the code isnt ssimply automatically correct, from being ported from Genesis?

Assistant:

Why it isn't automatically correct

Three structural reasons, in order of importance:

1. The original lives in the consumer, not the library. qd_to_python is in Genesis. Genesis owns the entire call path: it controls use_zerocopy, gs.backend, and it owns when qd.reset() is called relative to cache lifetime. When you push the same logic down into Quadrants' own classes, you inherit a different threat model: arbitrary user code can call qd.reset() between the cache fill and the next read, so cached views become use-after-free. The PR's commit history shows the AI hitting exactly this — segfaults on AMD GPU and Metal — and patching around it by disabling the numpy cache and changing the default of to_numpy mid-PR. The reviewer's comment #18 (no _reset hook on ScalarField/MatrixField) is the still-unfixed version of the same problem.

2. The AI re-decided design choices instead of transcribing them. Genesis's design has clear invariants that are easy to miss when refactoring: "always try zerocopy first, then clone if requested" (not "guard zerocopy behind copy preference"); "sync MPS twice — once before reading, once after cloning"; "cache transpose and numpy variants too". The AI re-derived the algorithm from intent and got the boundaries wrong in if copy is not True and can_zerocopy(...), dropped the second MPS sync, and shrunk the cache to a single attribute. These map directly to comments #15/#16/#21, #1/#22/#23, and #5/#6.

3. The Quadrants API is a different shape than the Genesis wrapper. Genesis exposes one entry point, so it gets one cache, one set of conditions, one sync sequence. Quadrants has four to six entry points (Ndarray.to_torch, Ndarray._ndarray_to_numpy, Ndarray._ndarray_matrix_to_numpy, ScalarField.to_torch/to_numpy, MatrixField.to_torch/to_numpy, StructField.to_torch/to_numpy). Each got a copy-pasted version of the same boilerplate, drifting slightly between sites — comments #13 and #14 are just "you have the same logic twice in matrix.py", and the situation is the same across files. The Genesis design can't be ported one-method-at-a-time without first factoring a single helper that all the methods call.

A faithful port would have:

  • Put the whole Genesis algorithm into _interop.py as one helper (qd_to_python-equivalent), kept the four cache attributes (_tc/_T_tc/_np/_T_np), replaced gs.use_zerocopy/gs.backend with can_zerocopy(...) evaluated inside that helper, and made every to_torch/to_numpy a one-liner that calls it.
  • Added _reset hooks to every Field/Ndarray subclass before exposing the cache, since Quadrants exposes qd.reset() to users that Genesis never had to model.
  • Carried over both qd.sync() and torch.mps.synchronize() verbatim.
  • Kept the Genesis decision rule literally: "if zerocopy supported, take that path; clone iff copy=True."

The current PR is roughly an interpretation of Genesis rather than a port, and most of duburcqa's review is asking for the original Genesis behaviour back.

User: it sounds like you are recommending we do the actions under 'a faithful port', is that a fair impression?

Assistant:

Yes, that's a fair impression — but with one caveat I should make explicit.

I was describing what a faithful port would have looked like, and that aligns closely with what duburcqa is asking for in the review. So if you accept the reviewer's framing, then yes, "a faithful port" is roughly the recommended path.

The caveat: there's one place where Genesis itself may not be the right target. The use-after-free risk on qd.reset() is real and not present in Genesis (because Genesis owns the reset call). Inside Quadrants, you can't just copy Genesis verbatim — you also need invalidation hooks on every cache holder. So it's "faithful port + reset hooks", not pure faithful port.

Two other things worth flagging before you commit to that direction:

  1. Genesis's design isn't universally praised in the review either. The reviewer's comment [Build] Add clang-tidy linter, and fix lint errors #11 ("don't use try/except, use a flag") is actually a critique of a pattern Genesis uses too. So a fully faithful port would still draw that comment. Minor, but worth knowing.

  2. The to_numpy lifetime question is a genuine design call, not just a port mistake. Your fix commit on 2026-04-20 ("Make to_numpy() always copy unless copy=False is explicitly requested") was reacting to a real segfault. Genesis defaults to zero-copy because Genesis users know about the model. Quadrants has a wider audience and numpy arrays are conventionally expected to outlive their source. Reverting to "default zero-copy for to_numpy" to match Genesis would re-introduce that hazard unless you also fix invalidation. It's worth deciding deliberately rather than just matching Genesis.

So my honest recommendation is: yes, do the faithful-port refactor as the structural backbone (one central helper, full cache shape, both syncs, correct copy=True ⇒ zerocopy-then-clone), AND add reset hooks on all field types, AND have an explicit conversation with duburcqa about the to_numpy default before changing it back.

Add a "Zero-copy interop via DLPack" section to the interop user guide,
covering the copy=None/True/False semantics, support matrix (backends,
dtypes, Metal/torch version), caching and invalidation rules, Apple Metal
double-sync behaviour, lifetime caveats around qd.reset() / qd.init(),
and StructField behaviour.

Update the intro and summary table to advertise zero-copy as a first-class
mechanism alongside copy-based interop and direct kernel pass-through.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants