-
Notifications
You must be signed in to change notification settings - Fork 0
Vortex Tensor #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vortex Tensor #24
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,348 @@ | ||
| - Start Date: 2026-03-04 | ||
| - Tracking Issue: [vortex-data/vortex#0000](https://github.com/vortex-data/vortex/issues/0000) | ||
|
|
||
| ## Summary | ||
|
|
||
| We would like to add a Tensor type to Vortex as an extension over `FixedSizeList`. This RFC proposes | ||
| the design of a fixed-shape tensor with contiguous backing memory. | ||
|
|
||
| ## Motivation | ||
|
|
||
| #### Tensors in the wild | ||
|
|
||
| Tensors are multi-dimensional (n-dimensional) arrays that generalize vectors (1D) and matrices (2D) | ||
| to arbitrary dimensions. They are quite common in ML/AI and scientific computing applications. To | ||
| name just a few examples: | ||
|
|
||
| - Image or video data stored as `height x width x channels` | ||
| - Multi-dimensional sensor or time-series data | ||
| - Embedding vectors from language models and recommendation systems | ||
|
|
||
| #### Tensors in Vortex | ||
|
|
||
| In the current version of Vortex, there are two ways to represent fixed-shape tensors using the | ||
| `FixedSizeList` `DType`, and neither seems satisfactory. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Am I allowed to implement a |
||
|
|
||
| The simplest approach is to flatten the tensor into a single `FixedSizeList<n>` whose size is the | ||
| product of all dimensions (this is what Apache Arrow does). However, this discards shape information | ||
| entirely: a `2x3` matrix and a `3x2` matrix would both become `FixedSizeList<6>`. Shape metadata | ||
| must be stored separately, and any dimension-aware operation (slicing along an axis, transposing, | ||
| etc.) reduces to manual index arithmetic with no type-level guarantees. | ||
|
|
||
| The alternative is to nest `FixedSizeList` types, e.g., `FixedSizeList<FixedSizeList<n>, m>` for a | ||
| matrix. This preserves some structure, but becomes unwieldy for higher-dimensional tensors. | ||
| Axis-specific slicing or indexing on individual tensors (tensor scalars, not tensor arrays) would | ||
| require custom expressions aware of the specific nesting depth, rather than operating on a single, | ||
| uniform tensor type. | ||
|
|
||
| Additionally, reshaping requires restructuring the entire nested type, and operations like | ||
| transposes would be difficult to implement correctly. | ||
|
|
||
| Beyond these structural issues, neither approach stores shape and stride metadata explicitly, making | ||
| interoperability with external tensor libraries (NumPy, PyTorch, etc.) that expect contiguous memory | ||
| with this metadata awkward. | ||
|
|
||
| Thus, we propose a dedicated extension type that encapsulates tensor semantics (shape, strides, | ||
| dimension-aware operations) on top of contiguous, row-major (C-style) backing memory. | ||
|
|
||
| ## Design | ||
|
|
||
| Since the design of extension types has not been fully solved yet (see | ||
| [RFC #0005](https://github.com/vortex-data/rfcs/pull/5)), the complete design of tensors cannot be | ||
| fully described here. However, we do know enough that we can present the general idea here. | ||
|
|
||
| ### Storage Type | ||
|
|
||
| Extension types in Vortex require defining a canonical storage type that represents what the | ||
| extension array looks like when it is canonicalized. For tensors, we will want this storage type to | ||
| be a `FixedSizeList<p, s>`, where `p` is a numeric type (like `u8`, `f64`, or a decimal type), and | ||
| where `s` is the product of all dimensions of the tensor. | ||
|
|
||
| For example, if we want to represent a tensor of `i32` with dimensions `[2, 3, 4]`, the storage type | ||
| for this tensor would be `FixedSizeList<i32, 24>` since `2 x 3 x 4 = 24`. | ||
|
|
||
| This is equivalent to the design of Arrow's canonical Tensor extension type. For discussion on why | ||
| we choose not to represent tensors as nested FSLs (for example | ||
| `FixedSizeList<FixedSizeList<FixedSizeList<i32, 2>, 3>, 4>`), see the [alternatives](#alternatives) | ||
| section. | ||
|
|
||
| ### Element Type | ||
|
|
||
| We restrict tensor element types to `Primitive` and `Decimal`. Tensors are fundamentally about dense | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why Decimal? That seems bizarre to me. Are there any fast implementation of matmul for arrays of Decimals? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does PyTorch support decimal? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Honestly, support for fast matmul of fixed-point types was also pretty garbage last time I looked. Does anyone need fixed-point matrices?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah if we're going to restrict it, let's just say Primitive for now. |
||
| numeric computation, and operations like transpose, reshape, and slicing rely on uniform, fixed-size | ||
| elements whose offsets are computable from strides. | ||
|
|
||
| Variable-size types (like strings) would break this model entirely. `Bool` is excluded as well | ||
| because Vortex bit-packs boolean arrays, which conflicts with byte-level stride arithmetic. This | ||
| matches PyTorch, which also restricts tensors to numeric types. | ||
|
|
||
| Theoretically, we could allow more element types in the future, but it should remain a very low | ||
| priority. | ||
|
|
||
| ### Validity | ||
|
|
||
| We define two layers of nullability for tensors: the tensor itself may be null (within a tensor | ||
| array), and individual elements within a tensor may be null. However, we do not support nulling out | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why allow the elements to be null? IMO, the main reason to use a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW: I feel pretty strongly that we shouldn't support nullable elements of a tensor.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's always something we can relax later, so I'm in favor of restricting this now |
||
| entire sub-dimensions of a tensor (e.g., marking a whole row or slice as null). | ||
|
|
||
| The validity bitmap is flat (one bit per element) and follows the same contiguous layout as the | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure what this sentence is saying? It sounds like tensors store additional validity on top of FSL. But actually we're just saying a tensor uses FSL as its storage type? |
||
| backing data (just like `FixedSizeList`). This keeps stride-based access straightforward while still | ||
| allowing sparse values within an otherwise dense tensor. | ||
|
|
||
| Note that this design is specifically for a dense tensor. A sparse tensor would likely need to have | ||
| a different representation (or different storage type) in order to compress better (likely `List` or | ||
| `ListView` since it can compress runs of nulls very well). | ||
|
|
||
| ### Metadata | ||
|
|
||
| Theoretically, we only need the dimensions of the tensor to have a useful Tensor type. However, we | ||
| likely also want two other pieces of information, the dimension names and the permutation order, | ||
| which mimics the [Arrow Fixed Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#fixed-shape-tensor) | ||
| type (which is a Canonical Extension type). | ||
|
|
||
| Here is what the metadata of an extension Tensor type in Vortex will look like (in Rust): | ||
|
|
||
| ```rust | ||
| /// Metadata for a [`Tensor`] extension type. | ||
| #[derive(Debug, Clone, PartialEq, Eq, Hash)] | ||
| pub struct TensorMetadata { | ||
| /// The shape of the tensor. | ||
| /// | ||
| /// The shape is always defined over row-major storage. May be empty (0D scalar tensor) or | ||
| /// contain dimensions of size 0 (degenerate tensor). | ||
| shape: Vec<usize>, | ||
|
|
||
| /// Optional names for each dimension. Each name corresponds to a dimension in the `shape`. | ||
| /// | ||
| /// If names exist, there must be an equal number of names to dimensions. | ||
| dim_names: Option<Vec<String>>, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We want to do it this way since this is what arrow has, and also I personally do not want to deal with some dimensions being named and others not named. |
||
|
|
||
| /// The permutation of the tensor's dimensions, mapping each logical dimension to its | ||
| /// corresponding physical dimension: `permutation[logical] = physical`. | ||
| /// | ||
| /// If this is `None`, then the logical and physical layout are equal, and the permutation is | ||
| /// in-order `[0, 1, ..., N-1]`. | ||
| permutation: Option<Vec<usize>>, | ||
| } | ||
| ``` | ||
|
|
||
| #### Stride | ||
|
|
||
| The stride of a tensor defines the number of elements to skip in memory to move one step along each | ||
| dimension. Rather than storing strides explicitly as metadata, we can efficiently derive them from | ||
| the shape and permutation. This is possible because the backing memory is always contiguous. | ||
|
|
||
| For a row-major tensor with shape `d = [d_0, d_1, ..., d_{n-1}]` and no permutation, the strides | ||
| are: | ||
|
|
||
| ``` | ||
| stride[n-1] = 1 (innermost dimension always has stride 1) | ||
| stride[i] = d[i+1] * stride[i+1] | ||
| stride[i] = d[i+1] * d[i+2] * ... * d[n-1] | ||
| ``` | ||
|
|
||
| For example, a tensor with shape `[2, 3, 4]` and no permutation has strides `[12, 4, 1]`: moving one | ||
| step along dimension 0 skips 12 elements, along dimension 1 skips 4, and along dimension 2 skips 1. | ||
| The element at index `[i, j, k]` is located at memory offset `12*i + 4*j + k`. | ||
|
|
||
| When a permutation is present, the logical strides are simply the row-major strides permuted | ||
| accordingly. Continuing the `[2, 3, 4]` example with row-major strides `[12, 4, 1]`, applying the | ||
| permutation `[2, 0, 1]` yields logical strides `[1, 12, 4]`. This reorders which dimensions are | ||
| contiguous in memory without copying any data. | ||
|
|
||
| ### Conversions | ||
|
|
||
| #### Arrow | ||
|
|
||
| Since our storage type and metadata are designed to match Arrow's Fixed Shape Tensor canonical | ||
| extension type, conversion to and from Arrow is zero-copy (for tensors with at least one dimension). | ||
| The `FixedSizeList` backing memory, shape, dimension names, and permutation all map directly between | ||
| the two representations. | ||
|
|
||
| #### NumPy and PyTorch | ||
|
|
||
| Libraries like NumPy and PyTorch store strides as an independent, first-class field on their tensor | ||
| objects. This allows them to represent non-contiguous views of memory. | ||
|
|
||
| For example, slicing every other row of a matrix produces a view with a doubled row stride, sharing | ||
| memory with the original without copying. However, this means that non-contiguous tensors can appear | ||
| anywhere, and kernels must handle arbitrary stride patterns. PyTorch supposedly requires many | ||
| operations to call `.contiguous()` before proceeding. | ||
|
|
||
| Since Vortex tensors are always contiguous, we can always zero-copy _to_ NumPy and PyTorch since | ||
| both libraries can construct a view from a pointer, shape, and strides. Going the other direction, | ||
| we can only zero-copy _from_ NumPy/PyTorch tensors that are already contiguous. | ||
|
|
||
| Our proposed design for Vortex Tensors will handle non-contiguous operations differently than the | ||
| Python libraries. Rather than mutating strides to create non-contiguous views, operations like | ||
| slicing, indexing, and `.contiguous()` (after a permutation) would be expressed as lazy | ||
| `Expression`s over the tensor. | ||
|
|
||
| These expressions describe the operation without materializing it, and when evaluated, they produce | ||
| a new contiguous tensor. This fits naturally into Vortex's existing lazy compute system, where | ||
| compute is deferred and composed rather than eagerly applied. | ||
|
|
||
| The exact mechanism for defining expressions over extension types is still being designed (see | ||
| [RFC #0005](https://github.com/vortex-data/rfcs/pull/5)), but the intent is that tensor-specific | ||
| operations like axis slicing, indexing, and reshaping would be custom expressions registered for the | ||
| tensor extension type. | ||
|
|
||
| ### Edge Cases: 0D and Size-0 Dimensions | ||
|
|
||
| We will support two edge cases that arise naturally from the tensor model. Recall that the number of | ||
| elements in a tensor is the product of its shape dimensions, and that the | ||
| [empty product](https://en.wikipedia.org/wiki/Empty_product) is 1 (the multiplicative identity). | ||
|
|
||
| #### 0-dimensional tensors | ||
|
|
||
| 0D tensors have an empty shape `[]` and contain exactly one element (since the product of no | ||
| dimensions is 1). These represent scalar values wrapped in the tensor type. The storage type is | ||
| `FixedSizeList<p, 1>` (which is identical to a flat `PrimitiveArray` or `DecimalArray`). | ||
|
|
||
| #### Size-0 dimensions | ||
|
|
||
| Shapes may contain dimensions of size 0 (e.g., `[3, 0, 4]`), which produce tensors with zero | ||
| elements (since the product includes a 0 factor). The storage type is a degenerate | ||
| `FixedSizeList<p, 0>`, which Vortex already handles well. | ||
|
|
||
| #### Compatibility | ||
|
|
||
| Both NumPy and PyTorch support these cases. NumPy fully supports 0D arrays with shape `()`, and | ||
| dimensions of size 0 are valid (e.g., `np.zeros((3, 0, 4))`). PyTorch supports 0D tensors since | ||
| v0.4.0 and also allows size-0 dimensions. | ||
|
|
||
| Arrow's Fixed Shape Tensor spec, however, requires at least one dimension (`ndim >= 1`), so 0D | ||
| tensors would need special handling during Arrow conversion (we would likely just panic). | ||
|
|
||
| ### Compression | ||
|
|
||
| Since the storage type is `FixedSizeList` over numeric types, Vortex's existing encodings (like ALP, | ||
| FastLanes, etc.) will be applied to the flattened primitive buffer transparently. | ||
|
|
||
| However, there may be tensor-specific compression opportunities we could take advantage of. We will | ||
| leave this as an open question. | ||
|
|
||
| ### Scalar Representation | ||
|
|
||
| Once we add the `ScalarValue::Array` variant (see tracking issue | ||
| [vortex#6771](https://github.com/vortex-data/vortex/issues/6771)), we can easily pass around tensors | ||
| as `ArrayRef` scalars as well as lazily computed slices. | ||
|
|
||
| The `ExtVTable` also requires specifying an associated `NativeValue<'a>` Rust type that an extension | ||
| scalar can be unpacked into. We will want a `NativeTensor<'a>` type that references the backing | ||
| memory of the Tensor, and we can add useful operations to that type. | ||
|
|
||
| ## Compatibility | ||
|
|
||
| Since this is a new type built on an existing canonical type (`FixedSizeList`), there should be no | ||
| compatibility concerns. | ||
|
|
||
| ## Drawbacks | ||
|
|
||
| - **Fixed shape only**: This design only supports tensors where every element in the array has the | ||
| same shape. Variable-shape tensors (ragged arrays) are out of scope and would require a different | ||
| type entirely. | ||
| - **Yet another crate**: We will likely implement this in a `vortex-tensor` crate, which means even | ||
| more surface area than we already have. | ||
|
|
||
| ## Alternatives | ||
|
|
||
| ### Nested `FixedSizeList` | ||
|
|
||
| Rather than a flat `FixedSizeList` with metadata, we could represent tensors as nested | ||
| `FixedSizeList` types (e.g., `FixedSizeList<FixedSizeList<FixedSizeList<i32, 4>, 3>, 2>` for a | ||
| `[2, 3, 4]` tensor). This has several disadvantages: | ||
|
|
||
| - Each nesting level introduces its own validity bitmap, even though sub-dimensional nullability is | ||
| not meaningful for tensors. This wastes space and complicates null-handling logic. | ||
| - This does not match Arrow's canonical Fixed Shape Tensor type, making zero-copy conversion | ||
| impossible. | ||
| - Expressions would need to be aware of the nesting depth, and operations like transpose or reshape | ||
| would require restructuring the type itself rather than updating metadata. | ||
|
|
||
| ### Do nothing | ||
|
|
||
| Users could continue to use `FixedSizeList` directly with out-of-band shape metadata. This works | ||
| for simple storage, but as discussed in the [motivation](#motivation), it provides no type-level | ||
| support for tensor operations and makes interoperability with tensor libraries awkward. | ||
|
|
||
| ## Prior Art | ||
|
|
||
| _Note: This section was Claude-researched._ | ||
|
|
||
| ### Columnar formats | ||
|
|
||
| - **Apache Arrow** defines a | ||
| [Fixed Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#fixed-shape-tensor) | ||
| canonical extension type. Our design closely follows Arrow's approach: a flat `FixedSizeList` | ||
| storage type with shape, dimension names, and permutation metadata. Arrow also defines a | ||
| [Variable Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#variable-shape-tensor) | ||
| extension type for ragged tensors, which could inform future work. | ||
| - **Lance** delegates entirely to Arrow's type system, including extension types. Arrow extension | ||
| metadata (and therefore tensor metadata) is preserved end-to-end through Lance's storage layer, | ||
| which validates the approach of building tensor semantics as an extension on top of `FixedSizeList` | ||
| storage. | ||
| - **Parquet** has no native `FixedSizeList` logical type. Arrow's `FixedSizeList` is stored as a | ||
| regular `LIST` in Parquet, which adds conversion overhead via repetition levels. There is active | ||
| discussion about introducing `FixedSizeList` as a Parquet logical type, partly motivated by | ||
| tensor and embedding workloads. | ||
|
|
||
| ### Database systems | ||
|
|
||
| - **DuckDB** has a native `ARRAY` type (fixed-size list) but no dedicated tensor type. Community | ||
| discussions have proposed adding one, noting that nested `ARRAY` types can simulate | ||
| multi-dimensional arrays but lack tensor-specific operations. | ||
| - **DataFusion** uses Arrow's type system directly and has no dedicated tensor type. There is open | ||
| discussion about a logical type layer that could support extension types as first-class citizens. | ||
|
|
||
| ### Tensor libraries | ||
|
|
||
| - **NumPy** and **PyTorch** both represent tensors as contiguous (or non-contiguous) memory with | ||
| shape and stride metadata. Our design is a subset of this model — we always require contiguous | ||
| memory and derive strides from shape and permutation, as discussed in the | ||
| [conversions](#conversions) section. | ||
| - **[ndindex](https://quansight-labs.github.io/ndindex/index.html)** is a Python library that | ||
| provides a unified interface for representing and manipulating NumPy array indices (slices, | ||
| integers, ellipses, boolean arrays, etc.). It supports operations like canonicalization, shape | ||
| inference, and re-indexing onto array chunks. We will want to implement tensor compute expressions | ||
| in Vortex that are similar to the operations ndindex provides — for example, computing the result | ||
| shape of a slice or translating a logical index into a physical offset. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also worth noting xarray. That was where I first encountered the idea of named dimensions. It also has a notion of "coordinates" which are "marginal" arrays. For example, you might have a matrix of temperature values on the surface of the earth. The rows and columns of that matrix could have coordinate values that indicate the latitudes and longitudes associated with the rows and columns. |
||
|
|
||
| ### Academic work | ||
|
|
||
| - **TACO (Tensor Algebra Compiler)** separates the tensor storage format from the tensor program. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Taco is really great work! I guess I think of it more as a system for generating fast matmul kernels given the physical layout of two arrays.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, could be interesting to implement a tensor array that uses these sparse layouts though |
||
| Each dimension can independently be specified as dense or sparse, and dimensions can be reordered. | ||
| The Vortex approach of storing tensors as flat contiguous memory with a permutation is one specific | ||
| point in TACO's format space (all dimensions dense, with a specific dimension ordering). | ||
|
|
||
| ## Unresolved Questions | ||
|
|
||
| - Are two tensors with different permutations but the same logical values considered equal? This | ||
| affects deduplication and comparisons. The type metadata might be different but the entire tensor | ||
| value might be equal, so it seems strange to say that they are not actually equal? | ||
| - Are there potential tensor-specific compression schemes we can take advantage of? | ||
|
|
||
| ## Future Possibilities | ||
|
|
||
| #### Variable-shape tensors | ||
|
|
||
| Arrow defines a | ||
| [Variable Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#variable-shape-tensor) | ||
| extension type for arrays where each tensor can have a different shape. This would enable workloads | ||
| like batched sequences of different lengths. | ||
|
|
||
| #### Sparse tensors | ||
|
|
||
| A similar Sparse Tensor type could use `List` or `ListView` as its storage type to efficiently | ||
| represent tensors with many null or zero elements, as noted in the [validity](#validity) section. | ||
|
|
||
| #### Tensor-specific encodings | ||
|
|
||
| Beyond general-purpose compression, encodings tailored to tensor data (e.g., exploiting spatial | ||
| locality across dimensions) could improve compression ratios for specific workloads. | ||
|
|
||
| #### ndindex-style compute expressions | ||
|
|
||
| As the extension type expression system matures, we can implement a rich set of tensor indexing and | ||
| slicing operations inspired by [ndindex](https://quansight-labs.github.io/ndindex/index.html), | ||
| including slice canonicalization, shape inference, and chunk-level re-indexing. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps worth explicitly calling this
FixedShapeTensorsince it's not unreasonable to also want variable shape tensors (but of fixed dimension). For example, in genetics, we often want to take the ~100M rows of genetic variants and collapse into ~30K genes and, for each gene, construct a matrix of genotypes and run a regression. Those matrices always have the same dimensionality (2) but their shape varies (in this case, the sample axis is always the same, N_SAMPLES, but the genetic variant access depends on the size of that gene which varies from a few hundred base pairs (SRY) to 30,000 base pairs (TITIN).In the future, I can imagine we'll have both
FixedSizeTensor<f32, (a, b, c)>andTensor<f32, 3>(named tbd).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think we basically want to replicate both of Arrow's fixed and variable size tensors.