Skip to content

Variant RFC#15

Merged
gatesn merged 7 commits intodevelopfrom
adamg/variant
Mar 4, 2026
Merged

Variant RFC#15
gatesn merged 7 commits intodevelopfrom
adamg/variant

Conversation

@AdamGS
Copy link
Collaborator

@AdamGS AdamGS commented Feb 25, 2026

RFC for a logical Variant type.


In addition to a new canonical encoding, we'll need a few more pieces to make variant columns useful:

1. A set of new expressions, which extract children of variant arrays with a combination of path (similarly to `GetExpr`) and a dtype.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also worth mentioning expressions that convert to/from other variant-like data, e.g. JSON as a DType::Utf8 can be parsed into a DType::Variant.

I wonder if our JSON extension type has storage type DType::UTf8? or storage type DType::Variant...?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind JSON type is “string verified as JSON”, like a PG column.
So far my impression is that there’s no consistent naming, and any choice we make will end up conflicting with something

Copy link
Contributor

@gatesn gatesn Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we would just also implement the variant expressions over a JSON extension type array?


Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema. In addition to this "catch all" column, most system include the concept of "shredding", extracting a key with a specific type out of this column, and storing it in a dense way. This design can make commonly access subfields perform like first-class columns, while keeping the overall schema flexible.

I propose a new dtype - `DType::Variant`. The variant type is always nullable, and its canonical encoding is just an array with a single child array, which is encoded in some specialized variant type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how should we do execute_arrow for these, using the Parquet Variant? Or union?

Copy link
Collaborator Author

@AdamGS AdamGS Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some thoughts on this point, it might require a pretty big change some changes/extending our arrow exporting logic

@AdamGS AdamGS changed the title WIP: Variant RFC Variant RFC Feb 26, 2026
@gatesn
Copy link
Contributor

gatesn commented Feb 26, 2026

I think this all makes sense, but I think it should explicitly call out what changes you want to make to DType enum / Scalar enum / Canonical enum / etc.

Copy link
Contributor

@connortsui20 connortsui20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For something as complicated as the design space of variant, I think it would be worth putting together a few diagrams (could literally just be some text trees) that show the different kinds of variant and shredding designs as well as some concrete examples.

I'm happy to add this myself as well!

Vortex currently requires a strict schema, but real world data is often only semi-structured and deeply hierarchical. Logs, traces and user-generated data often take the form of many sparse fields.

This proposal introduces a new dtype - `Variant`, which can capture data with row-level schema, while storing it in a columnar form that can compress well while being available for efficient analysis in a columnar format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is being pedantic but I think it would be good to have a motivation section here. Do we care about supporting all possible variant types, or just JSON, for example? And it might be good to say that we want this because other formats support this and people like that

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say all types what do you mean?


Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema.

Variant types are usually stored in two ways - values that aren't accessed often in some system-specific binary encoding, and some number of "shredded" columns, where a specific key is extracted from the variant and stored in a dense format with a specific type, allowing for much more performant access. This design can make commonly accessed subfields perform like first-class columns, while keeping the overall schema flexible. Shredding policies differ by system, and can be pre-determined or inferred from the data itself or from usage patterns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "accessed" is the wrong word here? You could motivate this by giving an example of json data where a majority has the same type (string) but sometimes there happens to be a different type (int)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also the case here you have a int and a float? Or even a i8 and a u64 which is of course the same for json but not for a vortex scalar?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON->Variant conversion is a complex topic that is out of scope here IMO, JSON tiles deals with it and other systems have different behaviors, and I don't want to make a decision early that will lock us out


```rust
enum Variant {
Value(Scalar),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a vortex scalar?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more of a "actual type"


We'll start with a rough description of the variant type, as many different systems define in different ways (see the [Prior Art](#prior-art) section at the bottom of the page).

The variant type can be commonly described as the following rust type:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you proposing that the variant type is opaque or visable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the array is opaque


I suggest we add a new expression - `get_variant_element(path, dtype)` (name TBD) which will support flexible paths and allow extracting children from variants. I use the `path` argument in this document loosely, but a subset of JSONPath might be appropriate here, see the [prior art](#prior-art) section to see how other systems handle it.

Every variant encoding will need to be able to dispatch these behaviors, returning arrays of the expected type.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That are the ctors of a variant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand the question

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I create a Variant array?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not sure whats the right API here, one option @gatesn and I discussed is that you can have an extension type (json_utf8) that holds strings and on write gets parsed and figures out the shredding. Another is to have a builder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't create a Variant array. Or rather, you can but it just has a single child with DType::Variant.

You must construct a concrete implementation of a variant array, e.g. ParquetVariant.

## Unresolved Questions

- Do we want a JSON extension type that automatically compresses as variant?
- How do variant expressions operate over different variant encodings?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we expand on this?

Copy link
Collaborator Author

@AdamGS AdamGS Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we don't push arithmetic operations down? Missed those code paths.

I guess these might be reduce rules? Allowing variant encodings to implement their own specific logic for each of the variant-specific functions. Am I making sense? I don't think I fully understand how all the pieces connect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the iterative execution could cover this. But if you're extracting a column from the binary variant (i.e. non-shredded), then ideally you would fuse the extraction with the projection expression. Maybe this is too tricky to figure out at the moment and actually we could get most of the benefits from some sort of pipelined execution (run over 2k elements at a time for instance).

AdamGS added 6 commits March 4, 2026 15:17
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
.
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Copy link
Contributor

@gatesn gatesn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's ship it and start implementing. We may run into issues but these can be reflected in the developer guide.

@gatesn gatesn merged commit a08d67c into develop Mar 4, 2026
3 checks passed
@gatesn gatesn deleted the adamg/variant branch March 4, 2026 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants