Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
348 changes: 348 additions & 0 deletions proposed/0024-tensor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,348 @@
- Start Date: 2026-03-04
- Tracking Issue: [vortex-data/vortex#0000](https://github.com/vortex-data/vortex/issues/0000)

## Summary

We would like to add a Tensor type to Vortex as an extension over `FixedSizeList`. This RFC proposes
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps worth explicitly calling this FixedShapeTensor since it's not unreasonable to also want variable shape tensors (but of fixed dimension). For example, in genetics, we often want to take the ~100M rows of genetic variants and collapse into ~30K genes and, for each gene, construct a matrix of genotypes and run a regression. Those matrices always have the same dimensionality (2) but their shape varies (in this case, the sample axis is always the same, N_SAMPLES, but the genetic variant access depends on the size of that gene which varies from a few hundred base pairs (SRY) to 30,000 base pairs (TITIN).

In the future, I can imagine we'll have both FixedSizeTensor<f32, (a, b, c)> and Tensor<f32, 3> (named tbd).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think we basically want to replicate both of Arrow's fixed and variable size tensors.

the design of a fixed-shape tensor with contiguous backing memory.

## Motivation

#### Tensors in the wild

Tensors are multi-dimensional (n-dimensional) arrays that generalize vectors (1D) and matrices (2D)
to arbitrary dimensions. They are quite common in ML/AI and scientific computing applications. To
name just a few examples:

- Image or video data stored as `height x width x channels`
- Multi-dimensional sensor or time-series data
- Embedding vectors from language models and recommendation systems

#### Tensors in Vortex

In the current version of Vortex, there are two ways to represent fixed-shape tensors using the
`FixedSizeList` `DType`, and neither seems satisfactory.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I allowed to implement a SparseTensorArray whose dtype is Tensor but whose layout is not a FixedSizeList of the right size?


The simplest approach is to flatten the tensor into a single `FixedSizeList<n>` whose size is the
product of all dimensions (this is what Apache Arrow does). However, this discards shape information
entirely: a `2x3` matrix and a `3x2` matrix would both become `FixedSizeList<6>`. Shape metadata
must be stored separately, and any dimension-aware operation (slicing along an axis, transposing,
etc.) reduces to manual index arithmetic with no type-level guarantees.

The alternative is to nest `FixedSizeList` types, e.g., `FixedSizeList<FixedSizeList<n>, m>` for a
matrix. This preserves some structure, but becomes unwieldy for higher-dimensional tensors.
Axis-specific slicing or indexing on individual tensors (tensor scalars, not tensor arrays) would
require custom expressions aware of the specific nesting depth, rather than operating on a single,
uniform tensor type.

Additionally, reshaping requires restructuring the entire nested type, and operations like
transposes would be difficult to implement correctly.

Beyond these structural issues, neither approach stores shape and stride metadata explicitly, making
interoperability with external tensor libraries (NumPy, PyTorch, etc.) that expect contiguous memory
with this metadata awkward.

Thus, we propose a dedicated extension type that encapsulates tensor semantics (shape, strides,
dimension-aware operations) on top of contiguous, row-major (C-style) backing memory.

## Design

Since the design of extension types has not been fully solved yet (see
[RFC #0005](https://github.com/vortex-data/rfcs/pull/5)), the complete design of tensors cannot be
fully described here. However, we do know enough that we can present the general idea here.

### Storage Type

Extension types in Vortex require defining a canonical storage type that represents what the
extension array looks like when it is canonicalized. For tensors, we will want this storage type to
be a `FixedSizeList<p, s>`, where `p` is a numeric type (like `u8`, `f64`, or a decimal type), and
where `s` is the product of all dimensions of the tensor.

For example, if we want to represent a tensor of `i32` with dimensions `[2, 3, 4]`, the storage type
for this tensor would be `FixedSizeList<i32, 24>` since `2 x 3 x 4 = 24`.

This is equivalent to the design of Arrow's canonical Tensor extension type. For discussion on why
we choose not to represent tensors as nested FSLs (for example
`FixedSizeList<FixedSizeList<FixedSizeList<i32, 2>, 3>, 4>`), see the [alternatives](#alternatives)
section.

### Element Type

We restrict tensor element types to `Primitive` and `Decimal`. Tensors are fundamentally about dense
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Decimal? That seems bizarre to me. Are there any fast implementation of matmul for arrays of Decimals?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does PyTorch support decimal?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, support for fast matmul of fixed-point types was also pretty garbage last time I looked. Does anyone need fixed-point matrices?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if we're going to restrict it, let's just say Primitive for now.

numeric computation, and operations like transpose, reshape, and slicing rely on uniform, fixed-size
elements whose offsets are computable from strides.

Variable-size types (like strings) would break this model entirely. `Bool` is excluded as well
because Vortex bit-packs boolean arrays, which conflicts with byte-level stride arithmetic. This
matches PyTorch, which also restricts tensors to numeric types.

Theoretically, we could allow more element types in the future, but it should remain a very low
priority.

### Validity

We define two layers of nullability for tensors: the tensor itself may be null (within a tensor
array), and individual elements within a tensor may be null. However, we do not support nulling out
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why allow the elements to be null?

IMO, the main reason to use a Tensor type is so that you can define operations like matmul and I worry that we can't efficiently implement matmul on a nullable type like f32?.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: I feel pretty strongly that we shouldn't support nullable elements of a tensor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always something we can relax later, so I'm in favor of restricting this now

entire sub-dimensions of a tensor (e.g., marking a whole row or slice as null).

The validity bitmap is flat (one bit per element) and follows the same contiguous layout as the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this sentence is saying? It sounds like tensors store additional validity on top of FSL. But actually we're just saying a tensor uses FSL as its storage type?

backing data (just like `FixedSizeList`). This keeps stride-based access straightforward while still
allowing sparse values within an otherwise dense tensor.

Note that this design is specifically for a dense tensor. A sparse tensor would likely need to have
a different representation (or different storage type) in order to compress better (likely `List` or
`ListView` since it can compress runs of nulls very well).

### Metadata

Theoretically, we only need the dimensions of the tensor to have a useful Tensor type. However, we
likely also want two other pieces of information, the dimension names and the permutation order,
which mimics the [Arrow Fixed Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#fixed-shape-tensor)
type (which is a Canonical Extension type).

Here is what the metadata of an extension Tensor type in Vortex will look like (in Rust):

```rust
/// Metadata for a [`Tensor`] extension type.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct TensorMetadata {
/// The shape of the tensor.
///
/// The shape is always defined over row-major storage. May be empty (0D scalar tensor) or
/// contain dimensions of size 0 (degenerate tensor).
shape: Vec<usize>,

/// Optional names for each dimension. Each name corresponds to a dimension in the `shape`.
///
/// If names exist, there must be an equal number of names to dimensions.
dim_names: Option<Vec<String>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vec<Option<String>>? Not sure...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to do it this way since this is what arrow has, and also I personally do not want to deal with some dimensions being named and others not named.


/// The permutation of the tensor's dimensions, mapping each logical dimension to its
/// corresponding physical dimension: `permutation[logical] = physical`.
///
/// If this is `None`, then the logical and physical layout are equal, and the permutation is
/// in-order `[0, 1, ..., N-1]`.
permutation: Option<Vec<usize>>,
}
```

#### Stride

The stride of a tensor defines the number of elements to skip in memory to move one step along each
dimension. Rather than storing strides explicitly as metadata, we can efficiently derive them from
the shape and permutation. This is possible because the backing memory is always contiguous.

For a row-major tensor with shape `d = [d_0, d_1, ..., d_{n-1}]` and no permutation, the strides
are:

```
stride[n-1] = 1 (innermost dimension always has stride 1)
stride[i] = d[i+1] * stride[i+1]
stride[i] = d[i+1] * d[i+2] * ... * d[n-1]
```

For example, a tensor with shape `[2, 3, 4]` and no permutation has strides `[12, 4, 1]`: moving one
step along dimension 0 skips 12 elements, along dimension 1 skips 4, and along dimension 2 skips 1.
The element at index `[i, j, k]` is located at memory offset `12*i + 4*j + k`.

When a permutation is present, the logical strides are simply the row-major strides permuted
accordingly. Continuing the `[2, 3, 4]` example with row-major strides `[12, 4, 1]`, applying the
permutation `[2, 0, 1]` yields logical strides `[1, 12, 4]`. This reorders which dimensions are
contiguous in memory without copying any data.

### Conversions

#### Arrow

Since our storage type and metadata are designed to match Arrow's Fixed Shape Tensor canonical
extension type, conversion to and from Arrow is zero-copy (for tensors with at least one dimension).
The `FixedSizeList` backing memory, shape, dimension names, and permutation all map directly between
the two representations.

#### NumPy and PyTorch

Libraries like NumPy and PyTorch store strides as an independent, first-class field on their tensor
objects. This allows them to represent non-contiguous views of memory.

For example, slicing every other row of a matrix produces a view with a doubled row stride, sharing
memory with the original without copying. However, this means that non-contiguous tensors can appear
anywhere, and kernels must handle arbitrary stride patterns. PyTorch supposedly requires many
operations to call `.contiguous()` before proceeding.

Since Vortex tensors are always contiguous, we can always zero-copy _to_ NumPy and PyTorch since
both libraries can construct a view from a pointer, shape, and strides. Going the other direction,
we can only zero-copy _from_ NumPy/PyTorch tensors that are already contiguous.

Our proposed design for Vortex Tensors will handle non-contiguous operations differently than the
Python libraries. Rather than mutating strides to create non-contiguous views, operations like
slicing, indexing, and `.contiguous()` (after a permutation) would be expressed as lazy
`Expression`s over the tensor.

These expressions describe the operation without materializing it, and when evaluated, they produce
a new contiguous tensor. This fits naturally into Vortex's existing lazy compute system, where
compute is deferred and composed rather than eagerly applied.

The exact mechanism for defining expressions over extension types is still being designed (see
[RFC #0005](https://github.com/vortex-data/rfcs/pull/5)), but the intent is that tensor-specific
operations like axis slicing, indexing, and reshaping would be custom expressions registered for the
tensor extension type.

### Edge Cases: 0D and Size-0 Dimensions

We will support two edge cases that arise naturally from the tensor model. Recall that the number of
elements in a tensor is the product of its shape dimensions, and that the
[empty product](https://en.wikipedia.org/wiki/Empty_product) is 1 (the multiplicative identity).

#### 0-dimensional tensors

0D tensors have an empty shape `[]` and contain exactly one element (since the product of no
dimensions is 1). These represent scalar values wrapped in the tensor type. The storage type is
`FixedSizeList<p, 1>` (which is identical to a flat `PrimitiveArray` or `DecimalArray`).

#### Size-0 dimensions

Shapes may contain dimensions of size 0 (e.g., `[3, 0, 4]`), which produce tensors with zero
elements (since the product includes a 0 factor). The storage type is a degenerate
`FixedSizeList<p, 0>`, which Vortex already handles well.

#### Compatibility

Both NumPy and PyTorch support these cases. NumPy fully supports 0D arrays with shape `()`, and
dimensions of size 0 are valid (e.g., `np.zeros((3, 0, 4))`). PyTorch supports 0D tensors since
v0.4.0 and also allows size-0 dimensions.

Arrow's Fixed Shape Tensor spec, however, requires at least one dimension (`ndim >= 1`), so 0D
tensors would need special handling during Arrow conversion (we would likely just panic).

### Compression

Since the storage type is `FixedSizeList` over numeric types, Vortex's existing encodings (like ALP,
FastLanes, etc.) will be applied to the flattened primitive buffer transparently.

However, there may be tensor-specific compression opportunities we could take advantage of. We will
leave this as an open question.

### Scalar Representation

Once we add the `ScalarValue::Array` variant (see tracking issue
[vortex#6771](https://github.com/vortex-data/vortex/issues/6771)), we can easily pass around tensors
as `ArrayRef` scalars as well as lazily computed slices.

The `ExtVTable` also requires specifying an associated `NativeValue<'a>` Rust type that an extension
scalar can be unpacked into. We will want a `NativeTensor<'a>` type that references the backing
memory of the Tensor, and we can add useful operations to that type.

## Compatibility

Since this is a new type built on an existing canonical type (`FixedSizeList`), there should be no
compatibility concerns.

## Drawbacks

- **Fixed shape only**: This design only supports tensors where every element in the array has the
same shape. Variable-shape tensors (ragged arrays) are out of scope and would require a different
type entirely.
- **Yet another crate**: We will likely implement this in a `vortex-tensor` crate, which means even
more surface area than we already have.

## Alternatives

### Nested `FixedSizeList`

Rather than a flat `FixedSizeList` with metadata, we could represent tensors as nested
`FixedSizeList` types (e.g., `FixedSizeList<FixedSizeList<FixedSizeList<i32, 4>, 3>, 2>` for a
`[2, 3, 4]` tensor). This has several disadvantages:

- Each nesting level introduces its own validity bitmap, even though sub-dimensional nullability is
not meaningful for tensors. This wastes space and complicates null-handling logic.
- This does not match Arrow's canonical Fixed Shape Tensor type, making zero-copy conversion
impossible.
- Expressions would need to be aware of the nesting depth, and operations like transpose or reshape
would require restructuring the type itself rather than updating metadata.

### Do nothing

Users could continue to use `FixedSizeList` directly with out-of-band shape metadata. This works
for simple storage, but as discussed in the [motivation](#motivation), it provides no type-level
support for tensor operations and makes interoperability with tensor libraries awkward.

## Prior Art

_Note: This section was Claude-researched._

### Columnar formats

- **Apache Arrow** defines a
[Fixed Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#fixed-shape-tensor)
canonical extension type. Our design closely follows Arrow's approach: a flat `FixedSizeList`
storage type with shape, dimension names, and permutation metadata. Arrow also defines a
[Variable Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#variable-shape-tensor)
extension type for ragged tensors, which could inform future work.
- **Lance** delegates entirely to Arrow's type system, including extension types. Arrow extension
metadata (and therefore tensor metadata) is preserved end-to-end through Lance's storage layer,
which validates the approach of building tensor semantics as an extension on top of `FixedSizeList`
storage.
- **Parquet** has no native `FixedSizeList` logical type. Arrow's `FixedSizeList` is stored as a
regular `LIST` in Parquet, which adds conversion overhead via repetition levels. There is active
discussion about introducing `FixedSizeList` as a Parquet logical type, partly motivated by
tensor and embedding workloads.

### Database systems

- **DuckDB** has a native `ARRAY` type (fixed-size list) but no dedicated tensor type. Community
discussions have proposed adding one, noting that nested `ARRAY` types can simulate
multi-dimensional arrays but lack tensor-specific operations.
- **DataFusion** uses Arrow's type system directly and has no dedicated tensor type. There is open
discussion about a logical type layer that could support extension types as first-class citizens.

### Tensor libraries

- **NumPy** and **PyTorch** both represent tensors as contiguous (or non-contiguous) memory with
shape and stride metadata. Our design is a subset of this model — we always require contiguous
memory and derive strides from shape and permutation, as discussed in the
[conversions](#conversions) section.
- **[ndindex](https://quansight-labs.github.io/ndindex/index.html)** is a Python library that
provides a unified interface for representing and manipulating NumPy array indices (slices,
integers, ellipses, boolean arrays, etc.). It supports operations like canonicalization, shape
inference, and re-indexing onto array chunks. We will want to implement tensor compute expressions
in Vortex that are similar to the operations ndindex provides — for example, computing the result
shape of a slice or translating a logical index into a physical offset.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also worth noting xarray. That was where I first encountered the idea of named dimensions. It also has a notion of "coordinates" which are "marginal" arrays. For example, you might have a matrix of temperature values on the surface of the earth. The rows and columns of that matrix could have coordinate values that indicate the latitudes and longitudes associated with the rows and columns.


### Academic work

- **TACO (Tensor Algebra Compiler)** separates the tensor storage format from the tensor program.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taco is really great work! I guess I think of it more as a system for generating fast matmul kernels given the physical layout of two arrays.

http://tensor-compiler.org/publications.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, could be interesting to implement a tensor array that uses these sparse layouts though

Each dimension can independently be specified as dense or sparse, and dimensions can be reordered.
The Vortex approach of storing tensors as flat contiguous memory with a permutation is one specific
point in TACO's format space (all dimensions dense, with a specific dimension ordering).

## Unresolved Questions

- Are two tensors with different permutations but the same logical values considered equal? This
affects deduplication and comparisons. The type metadata might be different but the entire tensor
value might be equal, so it seems strange to say that they are not actually equal?
- Are there potential tensor-specific compression schemes we can take advantage of?

## Future Possibilities

#### Variable-shape tensors

Arrow defines a
[Variable Shape Tensor](https://arrow.apache.org/docs/format/CanonicalExtensions.html#variable-shape-tensor)
extension type for arrays where each tensor can have a different shape. This would enable workloads
like batched sequences of different lengths.

#### Sparse tensors

A similar Sparse Tensor type could use `List` or `ListView` as its storage type to efficiently
represent tensors with many null or zero elements, as noted in the [validity](#validity) section.

#### Tensor-specific encodings

Beyond general-purpose compression, encodings tailored to tensor data (e.g., exploiting spatial
locality across dimensions) could improve compression ratios for specific workloads.

#### ndindex-style compute expressions

As the extension type expression system matures, we can implement a rich set of tensor indexing and
slicing operations inspired by [ndindex](https://quansight-labs.github.io/ndindex/index.html),
including slice canonicalization, shape inference, and chunk-level re-indexing.