-
Notifications
You must be signed in to change notification settings - Fork 0
Fixed-Shape Tensor RFC revisions #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,8 +3,8 @@ | |
|
|
||
| ## Summary | ||
|
|
||
| We would like to add a Tensor type to Vortex as an extension over `FixedSizeList`. This RFC proposes | ||
| the design of a fixed-shape tensor with contiguous backing memory. | ||
| We would like to add a `FixedShapeTensor` type to Vortex as an extension over `FixedSizeList`. This | ||
| RFC proposes the design of a fixed-shape tensor with contiguous backing memory. | ||
|
|
||
| ## Motivation | ||
|
|
||
|
|
@@ -18,7 +18,7 @@ name just a few examples: | |
| - Multi-dimensional sensor or time-series data | ||
| - Embedding vectors from language models and recommendation systems | ||
|
|
||
| #### Tensors in Vortex | ||
| #### Fixed-shape tensors in Vortex | ||
|
|
||
| In the current version of Vortex, there are two ways to represent fixed-shape tensors using the | ||
| `FixedSizeList` `DType`, and neither seems satisfactory. | ||
|
|
@@ -54,44 +54,44 @@ fully described here. However, we do know enough that we can present the general | |
| ### Storage Type | ||
|
|
||
| Extension types in Vortex require defining a canonical storage type that represents what the | ||
| extension array looks like when it is canonicalized. For tensors, we will want this storage type to | ||
| be a `FixedSizeList<p, s>`, where `p` is a numeric type (like `u8`, `f64`, or a decimal type), and | ||
| where `s` is the product of all dimensions of the tensor. | ||
| extension array looks like when it is canonicalized. For fixed-shape tensors, we will want this | ||
| storage type to be a `FixedSizeList<p, s>`, where `p` is a primitive type (like `u8`, `f64`, etc.), | ||
| and where `s` is the product of all dimensions of the tensor. | ||
|
|
||
| For example, if we want to represent a tensor of `i32` with dimensions `[2, 3, 4]`, the storage type | ||
| for this tensor would be `FixedSizeList<i32, 24>` since `2 x 3 x 4 = 24`. | ||
|
|
||
| This is equivalent to the design of Arrow's canonical Tensor extension type. For discussion on why | ||
| we choose not to represent tensors as nested FSLs (for example | ||
| This is equivalent to the design of Arrow's canonical Fixed Shape Tensor extension type. For | ||
| discussion on why we choose not to represent tensors as nested FSLs (for example | ||
| `FixedSizeList<FixedSizeList<FixedSizeList<i32, 2>, 3>, 4>`), see the [alternatives](#alternatives) | ||
| section. | ||
|
|
||
| ### Element Type | ||
|
|
||
| We restrict tensor element types to `Primitive` and `Decimal`. Tensors are fundamentally about dense | ||
| numeric computation, and operations like transpose, reshape, and slicing rely on uniform, fixed-size | ||
| We restrict tensor element types to `Primitive`. Tensors are fundamentally about dense numeric | ||
| computation, and operations like transpose, reshape, and slicing rely on uniform, fixed-size | ||
| elements whose offsets are computable from strides. | ||
|
|
||
| Variable-size types (like strings) would break this model entirely. `Bool` is excluded as well | ||
| because Vortex bit-packs boolean arrays, which conflicts with byte-level stride arithmetic. This | ||
| matches PyTorch, which also restricts tensors to numeric types. | ||
| Variable-size types (like strings) would break this model entirely. `Bool` is excluded because | ||
| Vortex bit-packs boolean arrays, which conflicts with byte-level stride arithmetic. `Decimal` is | ||
| excluded because there are no fast implementations of tensor operations (e.g., matmul) for | ||
| fixed-point types. This matches PyTorch, which also restricts tensors to floating-point and integer | ||
| primitive types. | ||
|
|
||
| Theoretically, we could allow more element types in the future, but it should remain a very low | ||
| priority. | ||
| We could allow more element types in the future if a compelling use case arises, but it should | ||
| remain a very low priority. | ||
|
|
||
| ### Validity | ||
|
|
||
| We define two layers of nullability for tensors: the tensor itself may be null (within a tensor | ||
| array), and individual elements within a tensor may be null. However, we do not support nulling out | ||
| entire sub-dimensions of a tensor (e.g., marking a whole row or slice as null). | ||
| Nullability exists only at the tensor level: within a tensor array, an individual tensor may be | ||
| null, but elements within a tensor may not be. This is because tensor operations like matmul cannot | ||
| be efficiently implemented over nullable elements, and most tensor libraries (e.g., PyTorch) do not | ||
| support per-element nulls either. | ||
|
|
||
| The validity bitmap is flat (one bit per element) and follows the same contiguous layout as the | ||
| backing data (just like `FixedSizeList`). This keeps stride-based access straightforward while still | ||
| allowing sparse values within an otherwise dense tensor. | ||
| Since the storage type is `FixedSizeList`, the validity of the tensor array is inherited from the | ||
| `FixedSizeList`'s own validity bitmap (one bit per tensor, not per element). | ||
|
|
||
| Note that this design is specifically for a dense tensor. A sparse tensor would likely need to have | ||
| a different representation (or different storage type) in order to compress better (likely `List` or | ||
| `ListView` since it can compress runs of nulls very well). | ||
| This is a restriction we can relax in the future if a compelling use case arises. | ||
|
|
||
| ### Metadata | ||
|
|
||
|
|
@@ -100,12 +100,13 @@ likely also want two other pieces of information, the dimension names and the pe | |
| which mimics the [Arrow Fixed Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#fixed-shape-tensor) | ||
| type (which is a Canonical Extension type). | ||
|
|
||
| Here is what the metadata of an extension Tensor type in Vortex will look like (in Rust): | ||
| Here is what the metadata of the `FixedShapeTensor` extension type in Vortex will look like (in | ||
| Rust): | ||
|
|
||
| ```rust | ||
| /// Metadata for a [`Tensor`] extension type. | ||
| /// Metadata for a [`FixedShapeTensor`] extension type. | ||
| #[derive(Debug, Clone, PartialEq, Eq, Hash)] | ||
| pub struct TensorMetadata { | ||
| pub struct FixedShapeTensorMetadata { | ||
| /// The shape of the tensor. | ||
| /// | ||
| /// The shape is always defined over row-major storage. May be empty (0D scalar tensor) or | ||
|
|
@@ -126,7 +127,7 @@ pub struct TensorMetadata { | |
| } | ||
| ``` | ||
|
|
||
| #### Stride | ||
| ### Stride | ||
|
|
||
| The stride of a tensor defines the number of elements to skip in memory to move one step along each | ||
| dimension. Rather than storing strides explicitly as metadata, we can efficiently derive them from | ||
|
|
@@ -145,19 +146,53 @@ For example, a tensor with shape `[2, 3, 4]` and no permutation has strides `[12 | |
| step along dimension 0 skips 12 elements, along dimension 1 skips 4, and along dimension 2 skips 1. | ||
| The element at index `[i, j, k]` is located at memory offset `12*i + 4*j + k`. | ||
|
|
||
| When a permutation is present, the logical strides are simply the row-major strides permuted | ||
| accordingly. Continuing the `[2, 3, 4]` example with row-major strides `[12, 4, 1]`, applying the | ||
| permutation `[2, 0, 1]` yields logical strides `[1, 12, 4]`. This reorders which dimensions are | ||
| contiguous in memory without copying any data. | ||
| ### Physical vs. logical shape | ||
|
|
||
| When a permutation is present, stride derivation depends on whether `shape` is stored as physical | ||
| or logical (see [unresolved questions](#unresolved-questions)). If `shape` is **physical** | ||
| (matching Arrow's convention), the process is straightforward: compute row-major strides over the | ||
| stored shape, then permute them to get logical strides | ||
| (`logical_stride[i] = physical_stride[perm[i]]`). | ||
|
|
||
| Continuing the example with physical shape `[2, 3, 4]` and permutation `[2, 0, 1]`, the physical | ||
| strides are `[12, 4, 1]` and the logical strides are | ||
| `[physical_stride[2], physical_stride[0], physical_stride[1]]` = `[1, 12, 4]`. | ||
|
|
||
| If `shape` is **logical**, we must first invert the permutation to recover the physical shape | ||
| (`physical_shape[perm[l]] = shape[l]`), compute row-major strides over that, then map them back to | ||
| logical order. | ||
|
|
||
| For the same example with logical shape `[4, 2, 3]` and permutation `[2, 0, 1]`: | ||
| the physical shape is `[2, 3, 4]`, physical strides are `[12, 4, 1]`, and logical strides are | ||
| `[1, 12, 4]`. | ||
|
|
||
| We want to emphasize that this is the same result, but with an extra inversion step. In either case, | ||
| logical strides are always a permutation of the physical strides. | ||
|
|
||
| The choice of whether `shape` stores physical or logical dimensions also affects interoperability | ||
| with [Arrow](#arrow) and [NumPy/PyTorch](#numpy-and-pytorch) (see those sections for details), as | ||
| well as stride derivation complexity. | ||
|
|
||
| Physical shape favors Arrow compatibility and simpler stride math. Logical shape favors | ||
| NumPy/PyTorch compatibility and is arguably more intuitive for our users since Vortex has a logical | ||
| type system. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW, I think torch/numpy integration matters more for tensors than arrow compatibility. There's no linear algebra library that natively works on arrow arrays.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree, and the conversion will be cheap regardless |
||
|
|
||
| The cost of conversion in either direction is a cheap O(ndim) permutation at the boundary, so the | ||
| difference is more about convention than performance. | ||
|
|
||
| ### Conversions | ||
|
|
||
| #### Arrow | ||
|
|
||
| Since our storage type and metadata are designed to match Arrow's Fixed Shape Tensor canonical | ||
| extension type, conversion to and from Arrow is zero-copy (for tensors with at least one dimension). | ||
| The `FixedSizeList` backing memory, shape, dimension names, and permutation all map directly between | ||
| the two representations. | ||
| Our storage type and metadata are designed to closely match Arrow's Fixed Shape Tensor canonical | ||
| extension type. The `FixedSizeList` backing buffer, dimension names, and permutation pass through | ||
| unchanged, making the data conversion itself zero-copy (for tensors with at least one dimension). | ||
|
|
||
| Arrow stores `shape` as **physical** (the dimensions of the row-major layout). Whether the `shape` | ||
| field passes through directly depends on the outcome of the | ||
| [physical vs. logical shape](#physical-vs-logical-shape) open question. If Vortex adopts the same | ||
| convention, shape maps directly. If Vortex stores logical shape instead, conversion requires a | ||
| cheap O(ndim) scatter: `arrow_shape[perm[i]] = vortex_shape[i]`. | ||
|
|
||
| #### NumPy and PyTorch | ||
|
|
||
|
|
@@ -169,18 +204,28 @@ memory with the original without copying. However, this means that non-contiguou | |
| anywhere, and kernels must handle arbitrary stride patterns. PyTorch supposedly requires many | ||
| operations to call `.contiguous()` before proceeding. | ||
|
|
||
| Since Vortex tensors are always contiguous, we can always zero-copy _to_ NumPy and PyTorch since | ||
| both libraries can construct a view from a pointer, shape, and strides. Going the other direction, | ||
| we can only zero-copy _from_ NumPy/PyTorch tensors that are already contiguous. | ||
| NumPy and PyTorch store `shape` as **logical** (the dimensions the user indexes with). If Vortex | ||
| also stores logical shape, the shape field passes through unchanged. If Vortex stores physical | ||
| shape, a cheap O(ndim) permutation is needed at the boundary (see | ||
| [physical vs. logical shape](#physical-vs-logical-shape)). | ||
|
|
||
| Since Vortex fixed-shape tensors always have dense backing memory, we can always zero-copy _to_ | ||
| NumPy and PyTorch by passing the buffer pointer, logical shape, and logical strides. A permuted | ||
| Vortex tensor will appear as a non-C-contiguous view in these libraries, which they handle natively. | ||
|
|
||
| Our proposed design for Vortex Tensors will handle non-contiguous operations differently than the | ||
| Going the other direction, we can zero-copy _from_ any NumPy/PyTorch tensor whose memory is dense | ||
| (no gaps), even if it is not C-contiguous. A Fortran-order or otherwise permuted tensor can be | ||
| represented by deriving the appropriate permutation from its strides. Only tensors with actual | ||
| memory gaps (e.g., strided slices like `arr[::2]`) require a copy. | ||
|
|
||
| Our proposed design for Vortex `FixedShapeTensor` will handle operations differently than the | ||
| Python libraries. Rather than mutating strides to create non-contiguous views, operations like | ||
| slicing, indexing, and `.contiguous()` (after a permutation) would be expressed as lazy | ||
| `Expression`s over the tensor. | ||
| slicing, indexing, and reordering dimensions would be expressed as lazy `Expression`s over the | ||
| tensor. | ||
|
|
||
| These expressions describe the operation without materializing it, and when evaluated, they produce | ||
| a new contiguous tensor. This fits naturally into Vortex's existing lazy compute system, where | ||
| compute is deferred and composed rather than eagerly applied. | ||
| a new tensor with dense backing memory. This fits naturally into Vortex's existing lazy compute | ||
| system, where compute is deferred and composed rather than eagerly applied. | ||
|
|
||
| The exact mechanism for defining expressions over extension types is still being designed (see | ||
| [RFC #0005](https://github.com/vortex-data/rfcs/pull/5)), but the intent is that tensor-specific | ||
|
|
@@ -197,7 +242,7 @@ elements in a tensor is the product of its shape dimensions, and that the | |
|
|
||
| 0D tensors have an empty shape `[]` and contain exactly one element (since the product of no | ||
| dimensions is 1). These represent scalar values wrapped in the tensor type. The storage type is | ||
| `FixedSizeList<p, 1>` (which is identical to a flat `PrimitiveArray` or `DecimalArray`). | ||
| `FixedSizeList<p, 1>` (which is identical to a flat `PrimitiveArray`). | ||
|
|
||
| #### Size-0 dimensions | ||
|
|
||
|
|
@@ -225,12 +270,12 @@ leave this as an open question. | |
| ### Scalar Representation | ||
|
|
||
| Once we add the `ScalarValue::Array` variant (see tracking issue | ||
| [vortex#6771](https://github.com/vortex-data/vortex/issues/6771)), we can easily pass around tensors | ||
| as `ArrayRef` scalars as well as lazily computed slices. | ||
| [vortex#6771](https://github.com/vortex-data/vortex/issues/6771)), we can easily pass around | ||
| fixed-shape tensors as `ArrayRef` scalars as well as lazily computed slices. | ||
|
|
||
| The `ExtVTable` also requires specifying an associated `NativeValue<'a>` Rust type that an extension | ||
| scalar can be unpacked into. We will want a `NativeTensor<'a>` type that references the backing | ||
| memory of the Tensor, and we can add useful operations to that type. | ||
| scalar can be unpacked into. We will want a `NativeFixedShapeTensor<'a>` type that references the | ||
| backing memory of the tensor, and we can add useful operations to that type. | ||
|
|
||
| ## Compatibility | ||
|
|
||
|
|
@@ -242,8 +287,8 @@ compatibility concerns. | |
| - **Fixed shape only**: This design only supports tensors where every element in the array has the | ||
| same shape. Variable-shape tensors (ragged arrays) are out of scope and would require a different | ||
| type entirely. | ||
| - **Yet another crate**: We will likely implement this in a `vortex-tensor` crate, which means even | ||
| more surface area than we already have. | ||
| - **Yet another crate**: We will likely implement this in a `vortex-tensor` crate, which means | ||
| even more surface area than we already have. | ||
|
|
||
| ## Alternatives | ||
|
|
||
|
|
@@ -301,6 +346,12 @@ _Note: This section was Claude-researched._ | |
| shape and stride metadata. Our design is a subset of this model — we always require contiguous | ||
| memory and derive strides from shape and permutation, as discussed in the | ||
| [conversions](#conversions) section. | ||
| - **[xarray](https://docs.xarray.dev/en/stable/)** extends NumPy with named dimensions and | ||
| coordinate labels. Its | ||
| [data model](https://docs.xarray.dev/en/stable/user-guide/terminology.html) attaches names to each | ||
| dimension and associates "coordinate" arrays along those dimensions (e.g., latitude and longitude | ||
| values for the rows and columns of a temperature matrix). Our `dim_names` metadata is a subset of | ||
| xarray's model; coordinate arrays could be a future extension. | ||
| - **[ndindex](https://quansight-labs.github.io/ndindex/index.html)** is a Python library that | ||
| provides a unified interface for representing and manipulating NumPy array indices (slices, | ||
| integers, ellipses, boolean arrays, etc.). It supports operations like canonicalization, shape | ||
|
|
@@ -312,11 +363,14 @@ _Note: This section was Claude-researched._ | |
|
|
||
| - **TACO (Tensor Algebra Compiler)** separates the tensor storage format from the tensor program. | ||
| Each dimension can independently be specified as dense or sparse, and dimensions can be reordered. | ||
| The Vortex approach of storing tensors as flat contiguous memory with a permutation is one specific | ||
| point in TACO's format space (all dimensions dense, with a specific dimension ordering). | ||
| The Vortex approach of storing tensors as flat contiguous memory with a permutation is one | ||
| specific point in TACO's format space (all dimensions dense, with a specific dimension ordering). | ||
|
|
||
| ## Unresolved Questions | ||
|
|
||
| - Should `shape` store physical dimensions (matching Arrow) or logical dimensions (matching | ||
| NumPy/PyTorch)? See the [physical vs. logical shape](#physical-vs-logical-shape) discussion in | ||
| the stride section. The current RFC assumes physical shape, but this is not finalized. | ||
| - Are two tensors with different permutations but the same logical values considered equal? This | ||
| affects deduplication and comparisons. The type metadata might be different but the entire tensor | ||
| value might be equal, so it seems strange to say that they are not actually equal? | ||
|
|
@@ -333,8 +387,26 @@ like batched sequences of different lengths. | |
|
|
||
| #### Sparse tensors | ||
|
|
||
| A similar Sparse Tensor type could use `List` or `ListView` as its storage type to efficiently | ||
| represent tensors with many null or zero elements, as noted in the [validity](#validity) section. | ||
| A sparse tensor type could use `List` or `ListView` as its storage type to efficiently represent | ||
| tensors with many zero or absent elements. | ||
|
|
||
| #### A unified `Tensor` type | ||
|
|
||
| This RFC proposes `FixedShapeTensor` as a single, concrete extension type. However, tensors | ||
| naturally vary along two axes: shape (fixed vs. variable) and density (dense vs. sparse). Both a | ||
| variable-shape tensor (fixed dimensionality, variable shape per element) and a sparse tensor would | ||
| need a different storage type, since it needs to efficiently skip over zero or null regions (and | ||
| for both this would likely be `List` or `ListView`). | ||
|
|
||
| Each combination would be its own extension type (`FixedShapeTensor`, `VariableShapeTensor`, | ||
| `SparseFixedShapeTensor`, etc.), but this proliferates types and fragments any shared tensor logic. | ||
| With the matching system on extension types, we could instead define a single unified `Tensor` type | ||
| that covers all combinations, dispatching to the appropriate storage type and metadata based on the | ||
| specific variant. This would be more complex to implement but would give users a single type to work | ||
| with and a single place to define tensor operations. | ||
|
|
||
| For now, `FixedShapeTensor` is the only variant we need. The others can be added incrementally | ||
| as use cases arise. | ||
|
|
||
| #### Tensor-specific encodings | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commenting here but maybe it should go on the previous PR?
IDK how arrow does it, but I don't think that's necessarily true.
Most vectorized compute just runs through null values that are zeroed out, IDK what's how you matmul the validity itself, but I think that's a reasonable thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think interpretation of NULLs is context dependent. If NULL means "there was no data observed at this position" and you're doing a weighted sum of the features, treating NULLs as zero is probably the right choice. The result is indeed the count of what you observed. You can't infer anything about things you did not observe.
On the other hand, if NULL means "there is some data here but for technical reasons it was unrecoverable" and you're doing a linear regression, you probably want to replace NULL by a mean value over some dimension(s). I don't have a good linear regression example, but suppose you flip one hundred coins and record heads as 1 and tails as 0. Suppose further that you lose 10 coins before observing them. If you compute the sum of this vector with NULL as zeros you'll conclude the coins are tails-biased! If you compute the sum of this vector with NULL as the sample mean, you'll have an unbiased estimate of the coin's heads/tails probability.
IMO,
matmul,sum, etc. should only be defined on tensors with non-nullable elements. I suppose null elements are fine? if they're representable in torch (I think they are not?).Numpy is able to represent them when you use the catchall-object-dtype, but if you request primitive types it converts them to NaNs.