diff --git a/proposed/0024-tensor.md b/proposed/0024-tensor.md new file mode 100644 index 0000000..1ef9ece --- /dev/null +++ b/proposed/0024-tensor.md @@ -0,0 +1,348 @@ +- Start Date: 2026-03-04 +- Tracking Issue: [vortex-data/vortex#0000](https://github.com/vortex-data/vortex/issues/0000) + +## Summary + +We would like to add a Tensor type to Vortex as an extension over `FixedSizeList`. This RFC proposes +the design of a fixed-shape tensor with contiguous backing memory. + +## Motivation + +#### Tensors in the wild + +Tensors are multi-dimensional (n-dimensional) arrays that generalize vectors (1D) and matrices (2D) +to arbitrary dimensions. They are quite common in ML/AI and scientific computing applications. To +name just a few examples: + +- Image or video data stored as `height x width x channels` +- Multi-dimensional sensor or time-series data +- Embedding vectors from language models and recommendation systems + +#### Tensors in Vortex + +In the current version of Vortex, there are two ways to represent fixed-shape tensors using the +`FixedSizeList` `DType`, and neither seems satisfactory. + +The simplest approach is to flatten the tensor into a single `FixedSizeList` whose size is the +product of all dimensions (this is what Apache Arrow does). However, this discards shape information +entirely: a `2x3` matrix and a `3x2` matrix would both become `FixedSizeList<6>`. Shape metadata +must be stored separately, and any dimension-aware operation (slicing along an axis, transposing, +etc.) reduces to manual index arithmetic with no type-level guarantees. + +The alternative is to nest `FixedSizeList` types, e.g., `FixedSizeList, m>` for a +matrix. This preserves some structure, but becomes unwieldy for higher-dimensional tensors. +Axis-specific slicing or indexing on individual tensors (tensor scalars, not tensor arrays) would +require custom expressions aware of the specific nesting depth, rather than operating on a single, +uniform tensor type. + +Additionally, reshaping requires restructuring the entire nested type, and operations like +transposes would be difficult to implement correctly. + +Beyond these structural issues, neither approach stores shape and stride metadata explicitly, making +interoperability with external tensor libraries (NumPy, PyTorch, etc.) that expect contiguous memory +with this metadata awkward. + +Thus, we propose a dedicated extension type that encapsulates tensor semantics (shape, strides, +dimension-aware operations) on top of contiguous, row-major (C-style) backing memory. + +## Design + +Since the design of extension types has not been fully solved yet (see +[RFC #0005](https://github.com/vortex-data/rfcs/pull/5)), the complete design of tensors cannot be +fully described here. However, we do know enough that we can present the general idea here. + +### Storage Type + +Extension types in Vortex require defining a canonical storage type that represents what the +extension array looks like when it is canonicalized. For tensors, we will want this storage type to +be a `FixedSizeList`, where `p` is a numeric type (like `u8`, `f64`, or a decimal type), and +where `s` is the product of all dimensions of the tensor. + +For example, if we want to represent a tensor of `i32` with dimensions `[2, 3, 4]`, the storage type +for this tensor would be `FixedSizeList` since `2 x 3 x 4 = 24`. + +This is equivalent to the design of Arrow's canonical Tensor extension type. For discussion on why +we choose not to represent tensors as nested FSLs (for example +`FixedSizeList, 3>, 4>`), see the [alternatives](#alternatives) +section. + +### Element Type + +We restrict tensor element types to `Primitive` and `Decimal`. Tensors are fundamentally about dense +numeric computation, and operations like transpose, reshape, and slicing rely on uniform, fixed-size +elements whose offsets are computable from strides. + +Variable-size types (like strings) would break this model entirely. `Bool` is excluded as well +because Vortex bit-packs boolean arrays, which conflicts with byte-level stride arithmetic. This +matches PyTorch, which also restricts tensors to numeric types. + +Theoretically, we could allow more element types in the future, but it should remain a very low +priority. + +### Validity + +We define two layers of nullability for tensors: the tensor itself may be null (within a tensor +array), and individual elements within a tensor may be null. However, we do not support nulling out +entire sub-dimensions of a tensor (e.g., marking a whole row or slice as null). + +The validity bitmap is flat (one bit per element) and follows the same contiguous layout as the +backing data (just like `FixedSizeList`). This keeps stride-based access straightforward while still +allowing sparse values within an otherwise dense tensor. + +Note that this design is specifically for a dense tensor. A sparse tensor would likely need to have +a different representation (or different storage type) in order to compress better (likely `List` or +`ListView` since it can compress runs of nulls very well). + +### Metadata + +Theoretically, we only need the dimensions of the tensor to have a useful Tensor type. However, we +likely also want two other pieces of information, the dimension names and the permutation order, +which mimics the [Arrow Fixed Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#fixed-shape-tensor) +type (which is a Canonical Extension type). + +Here is what the metadata of an extension Tensor type in Vortex will look like (in Rust): + +```rust +/// Metadata for a [`Tensor`] extension type. +#[derive(Debug, Clone, PartialEq, Eq, Hash)] +pub struct TensorMetadata { + /// The shape of the tensor. + /// + /// The shape is always defined over row-major storage. May be empty (0D scalar tensor) or + /// contain dimensions of size 0 (degenerate tensor). + shape: Vec, + + /// Optional names for each dimension. Each name corresponds to a dimension in the `shape`. + /// + /// If names exist, there must be an equal number of names to dimensions. + dim_names: Option>, + + /// The permutation of the tensor's dimensions, mapping each logical dimension to its + /// corresponding physical dimension: `permutation[logical] = physical`. + /// + /// If this is `None`, then the logical and physical layout are equal, and the permutation is + /// in-order `[0, 1, ..., N-1]`. + permutation: Option>, +} +``` + +#### Stride + +The stride of a tensor defines the number of elements to skip in memory to move one step along each +dimension. Rather than storing strides explicitly as metadata, we can efficiently derive them from +the shape and permutation. This is possible because the backing memory is always contiguous. + +For a row-major tensor with shape `d = [d_0, d_1, ..., d_{n-1}]` and no permutation, the strides +are: + +``` +stride[n-1] = 1 (innermost dimension always has stride 1) +stride[i] = d[i+1] * stride[i+1] +stride[i] = d[i+1] * d[i+2] * ... * d[n-1] +``` + +For example, a tensor with shape `[2, 3, 4]` and no permutation has strides `[12, 4, 1]`: moving one +step along dimension 0 skips 12 elements, along dimension 1 skips 4, and along dimension 2 skips 1. +The element at index `[i, j, k]` is located at memory offset `12*i + 4*j + k`. + +When a permutation is present, the logical strides are simply the row-major strides permuted +accordingly. Continuing the `[2, 3, 4]` example with row-major strides `[12, 4, 1]`, applying the +permutation `[2, 0, 1]` yields logical strides `[1, 12, 4]`. This reorders which dimensions are +contiguous in memory without copying any data. + +### Conversions + +#### Arrow + +Since our storage type and metadata are designed to match Arrow's Fixed Shape Tensor canonical +extension type, conversion to and from Arrow is zero-copy (for tensors with at least one dimension). +The `FixedSizeList` backing memory, shape, dimension names, and permutation all map directly between +the two representations. + +#### NumPy and PyTorch + +Libraries like NumPy and PyTorch store strides as an independent, first-class field on their tensor +objects. This allows them to represent non-contiguous views of memory. + +For example, slicing every other row of a matrix produces a view with a doubled row stride, sharing +memory with the original without copying. However, this means that non-contiguous tensors can appear +anywhere, and kernels must handle arbitrary stride patterns. PyTorch supposedly requires many +operations to call `.contiguous()` before proceeding. + +Since Vortex tensors are always contiguous, we can always zero-copy _to_ NumPy and PyTorch since +both libraries can construct a view from a pointer, shape, and strides. Going the other direction, +we can only zero-copy _from_ NumPy/PyTorch tensors that are already contiguous. + +Our proposed design for Vortex Tensors will handle non-contiguous operations differently than the +Python libraries. Rather than mutating strides to create non-contiguous views, operations like +slicing, indexing, and `.contiguous()` (after a permutation) would be expressed as lazy +`Expression`s over the tensor. + +These expressions describe the operation without materializing it, and when evaluated, they produce +a new contiguous tensor. This fits naturally into Vortex's existing lazy compute system, where +compute is deferred and composed rather than eagerly applied. + +The exact mechanism for defining expressions over extension types is still being designed (see +[RFC #0005](https://github.com/vortex-data/rfcs/pull/5)), but the intent is that tensor-specific +operations like axis slicing, indexing, and reshaping would be custom expressions registered for the +tensor extension type. + +### Edge Cases: 0D and Size-0 Dimensions + +We will support two edge cases that arise naturally from the tensor model. Recall that the number of +elements in a tensor is the product of its shape dimensions, and that the +[empty product](https://en.wikipedia.org/wiki/Empty_product) is 1 (the multiplicative identity). + +#### 0-dimensional tensors + +0D tensors have an empty shape `[]` and contain exactly one element (since the product of no +dimensions is 1). These represent scalar values wrapped in the tensor type. The storage type is +`FixedSizeList` (which is identical to a flat `PrimitiveArray` or `DecimalArray`). + +#### Size-0 dimensions + +Shapes may contain dimensions of size 0 (e.g., `[3, 0, 4]`), which produce tensors with zero +elements (since the product includes a 0 factor). The storage type is a degenerate +`FixedSizeList`, which Vortex already handles well. + +#### Compatibility + +Both NumPy and PyTorch support these cases. NumPy fully supports 0D arrays with shape `()`, and +dimensions of size 0 are valid (e.g., `np.zeros((3, 0, 4))`). PyTorch supports 0D tensors since +v0.4.0 and also allows size-0 dimensions. + +Arrow's Fixed Shape Tensor spec, however, requires at least one dimension (`ndim >= 1`), so 0D +tensors would need special handling during Arrow conversion (we would likely just panic). + +### Compression + +Since the storage type is `FixedSizeList` over numeric types, Vortex's existing encodings (like ALP, +FastLanes, etc.) will be applied to the flattened primitive buffer transparently. + +However, there may be tensor-specific compression opportunities we could take advantage of. We will +leave this as an open question. + +### Scalar Representation + +Once we add the `ScalarValue::Array` variant (see tracking issue +[vortex#6771](https://github.com/vortex-data/vortex/issues/6771)), we can easily pass around tensors +as `ArrayRef` scalars as well as lazily computed slices. + +The `ExtVTable` also requires specifying an associated `NativeValue<'a>` Rust type that an extension +scalar can be unpacked into. We will want a `NativeTensor<'a>` type that references the backing +memory of the Tensor, and we can add useful operations to that type. + +## Compatibility + +Since this is a new type built on an existing canonical type (`FixedSizeList`), there should be no +compatibility concerns. + +## Drawbacks + +- **Fixed shape only**: This design only supports tensors where every element in the array has the + same shape. Variable-shape tensors (ragged arrays) are out of scope and would require a different + type entirely. +- **Yet another crate**: We will likely implement this in a `vortex-tensor` crate, which means even + more surface area than we already have. + +## Alternatives + +### Nested `FixedSizeList` + +Rather than a flat `FixedSizeList` with metadata, we could represent tensors as nested +`FixedSizeList` types (e.g., `FixedSizeList, 3>, 2>` for a +`[2, 3, 4]` tensor). This has several disadvantages: + +- Each nesting level introduces its own validity bitmap, even though sub-dimensional nullability is + not meaningful for tensors. This wastes space and complicates null-handling logic. +- This does not match Arrow's canonical Fixed Shape Tensor type, making zero-copy conversion + impossible. +- Expressions would need to be aware of the nesting depth, and operations like transpose or reshape + would require restructuring the type itself rather than updating metadata. + +### Do nothing + +Users could continue to use `FixedSizeList` directly with out-of-band shape metadata. This works +for simple storage, but as discussed in the [motivation](#motivation), it provides no type-level +support for tensor operations and makes interoperability with tensor libraries awkward. + +## Prior Art + +_Note: This section was Claude-researched._ + +### Columnar formats + +- **Apache Arrow** defines a + [Fixed Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#fixed-shape-tensor) + canonical extension type. Our design closely follows Arrow's approach: a flat `FixedSizeList` + storage type with shape, dimension names, and permutation metadata. Arrow also defines a + [Variable Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#variable-shape-tensor) + extension type for ragged tensors, which could inform future work. +- **Lance** delegates entirely to Arrow's type system, including extension types. Arrow extension + metadata (and therefore tensor metadata) is preserved end-to-end through Lance's storage layer, + which validates the approach of building tensor semantics as an extension on top of `FixedSizeList` + storage. +- **Parquet** has no native `FixedSizeList` logical type. Arrow's `FixedSizeList` is stored as a + regular `LIST` in Parquet, which adds conversion overhead via repetition levels. There is active + discussion about introducing `FixedSizeList` as a Parquet logical type, partly motivated by + tensor and embedding workloads. + +### Database systems + +- **DuckDB** has a native `ARRAY` type (fixed-size list) but no dedicated tensor type. Community + discussions have proposed adding one, noting that nested `ARRAY` types can simulate + multi-dimensional arrays but lack tensor-specific operations. +- **DataFusion** uses Arrow's type system directly and has no dedicated tensor type. There is open + discussion about a logical type layer that could support extension types as first-class citizens. + +### Tensor libraries + +- **NumPy** and **PyTorch** both represent tensors as contiguous (or non-contiguous) memory with + shape and stride metadata. Our design is a subset of this model — we always require contiguous + memory and derive strides from shape and permutation, as discussed in the + [conversions](#conversions) section. +- **[ndindex](https://quansight-labs.github.io/ndindex/index.html)** is a Python library that + provides a unified interface for representing and manipulating NumPy array indices (slices, + integers, ellipses, boolean arrays, etc.). It supports operations like canonicalization, shape + inference, and re-indexing onto array chunks. We will want to implement tensor compute expressions + in Vortex that are similar to the operations ndindex provides — for example, computing the result + shape of a slice or translating a logical index into a physical offset. + +### Academic work + +- **TACO (Tensor Algebra Compiler)** separates the tensor storage format from the tensor program. + Each dimension can independently be specified as dense or sparse, and dimensions can be reordered. + The Vortex approach of storing tensors as flat contiguous memory with a permutation is one specific + point in TACO's format space (all dimensions dense, with a specific dimension ordering). + +## Unresolved Questions + +- Are two tensors with different permutations but the same logical values considered equal? This + affects deduplication and comparisons. The type metadata might be different but the entire tensor + value might be equal, so it seems strange to say that they are not actually equal? +- Are there potential tensor-specific compression schemes we can take advantage of? + +## Future Possibilities + +#### Variable-shape tensors + +Arrow defines a +[Variable Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#variable-shape-tensor) +extension type for arrays where each tensor can have a different shape. This would enable workloads +like batched sequences of different lengths. + +#### Sparse tensors + +A similar Sparse Tensor type could use `List` or `ListView` as its storage type to efficiently +represent tensors with many null or zero elements, as noted in the [validity](#validity) section. + +#### Tensor-specific encodings + +Beyond general-purpose compression, encodings tailored to tensor data (e.g., exploiting spatial +locality across dimensions) could improve compression ratios for specific workloads. + +#### ndindex-style compute expressions + +As the extension type expression system matures, we can implement a rich set of tensor indexing and +slicing operations inspired by [ndindex](https://quansight-labs.github.io/ndindex/index.html), +including slice canonicalization, shape inference, and chunk-level re-indexing.