fix: normalize IEEE 754 -0.0 to +0.0 for consistent float comparison by LuciferYang · Pull Request #6236 · lance-format/lance

LuciferYang · 2026-03-20T09:38:18Z

Summary

IEEE 754 specifies that -0.0 == 0.0, but total_cmp and bit-level representations distinguish them. This caused incorrect results in scalar filters (e.g., x < 0.0 incorrectly included -0.0) and inconsistent Eq/Ord/Hash behavior for float scalars.

This PR normalizes -0.0 to +0.0 at each layer where float values are compared or hashed:

ArrowScalar (arrow-scalar) — normalize Float16/Float32/Float64 arrays before row encoding so that Eq, Ord, and Hash treat -0.0 and +0.0 identically
OrderableScalarValue (lance-index/btree) — normalize before total_cmp in the manual Ord impl for Float16/Float32/Float64
SargableQuery::to_expr / BloomFilterQuery::to_expr (lance-index/scalar) — normalize float ScalarValue literals when converting queries to DataFusion expressions
safe_coerce_scalar (lance-datafusion/expr) — normalize during float-to-float coercion

Test plan

test_neg_zero_equals_pos_zero_f16 / f32 / f64 — ArrowScalar equality and ordering
test_neg_zero_hash_equals_pos_zero — ArrowScalar hash consistency
test_neg_zero_cmp_f16 / f32 / f64 — OrderableScalarValue ordering
test_neg_zero_less_than_positive_f64 — normalized zero still orders correctly against other values
Existing property test sorted_array_produces_sorted_scalars continues to pass
cargo clippy and cargo fmt clean

github-actions · 2026-03-20T09:38:32Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

github-actions · 2026-03-20T09:39:50Z

Review

Clean, well-structured fix for the IEEE 754 -0.0 vs +0.0 inconsistency. The multi-layer approach (ArrowScalar row encoding, OrderableScalarValue Ord, query expression literals, coercion) is the right strategy — normalizing at each boundary rather than trying a single choke-point.

Minor observations (not blocking)

Unnecessary allocations for +0.0: The v == 0.0 check matches both -0.0 and +0.0, so normalize_float_neg_zero allocates a new length-1 array even when the value is already +0.0. A tighter check like v.to_bits() == f32::NEG_ZERO.to_bits() (or v.is_sign_negative() && v == 0.0) would skip the allocation for the common +0.0 case. Same applies to normalize_float_scalar and the Ord impl. The cost is negligible for scalars, but it would express the intent more precisely.

DRY opportunity: normalize_float_scalar in lance-index/src/scalar.rs and the inline normalization in btree.rs Ord impl do the same conceptual operation on different representations. Not a problem today, but if more float-handling sites appear, a shared normalization trait or utility could reduce surface area.

Overall: good fix, good tests, good PR description. LGTM.

codecov · 2026-03-20T10:08:50Z

Codecov Report

❌ Patch coverage is 87.69231% with 16 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-datafusion/src/expr.rs	40.00%	5 Missing and 1 partial ⚠️
rust/lance-index/src/scalar.rs	75.00%	6 Missing ⚠️
rust/arrow-scalar/src/lib.rs	96.55%	1 Missing and 1 partial ⚠️
rust/lance-index/src/scalar/btree.rs	94.73%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

LuciferYang · 2026-03-20T10:13:57Z

There are test failures, and I need to investigate them.

LuciferYang · 2026-03-20T11:24:49Z

The test failure is due to inconsistent behaviors between Rust Arrow/DataFusion and PyArrow.

Rust Arrow 57 (`arrow-ord`)

Arrow's comparison kernels (lt, eq, gt, etc.) use total_cmp as defined in IEEE 754-2008 totalOrder predicate. From the Arrow docs:

For floating values like f32 and f64, this comparison produces an ordering in accordance to the totalOrder predicate as defined in the IEEE 754 (2008 revision) floating point standard. Note that totalOrder treats positive and negative zeros as different. If it is necessary to treat them as equal, please normalize zeros before calling this kernel.

Under total_cmp: -0.0 < 0.0 is true, and -0.0 == 0.0 is false.

DataFusion 52.3.0

DataFusion delegates comparison to Arrow's kernels via apply_cmp -> arrow::compute::kernels::cmp::lt. Verified with DataFusion Python:

DataFusion SQL: value < 0.0
  ids: [1, 3, 6, 7]        ← includes -0.0 (id=1)

DataFusion SQL: value = 0.0
  ids: [0]                  ← excludes -0.0

PyArrow 23 (C++ Arrow)

PyArrow follows IEEE 754 comparison semantics:

less(-0.0, 0.0)       = False
equal(-0.0, 0.0)      = True

Summary Table

Operation	IEEE 754	`total_cmp` (Rust Arrow 57 / DataFusion 52)	PyArrow (C++ Arrow)
`-0.0 < 0.0`	false	true	false
`-0.0 == 0.0`	true	false	true
`-0.0 <= 0.0`	true	true	true

Conflict

This pr normalizes -0.0 to +0.0 in multiple locations to implement IEEE 754 semantics:

ArrowScalar row encoding (Eq/Ord/Hash consistency)
OrderableScalarValue::Ord (btree index comparisons)
SargableQuery::to_expr / BloomFilterQuery::to_expr (query expression literals)
safe_coerce_scalar (float coercion)
Bitmap index training (HashMap key normalization)

This makes Lance's index results inconsistent with DataFusion/Arrow scan results:

Lance bitmap index (value < 0.0): excludes -0.0 (IEEE 754)
DataFusion scan (value < 0.0): includes -0.0 (total_cmp)

The integration test test_query_float_special_values catches this inconsistency because it validates Lance scanner output against DataFusion SQL output.

Recommended Approach: Option B + C

After analysis, neither Option B (index/query normalization) nor Option C (storage normalization) alone is sufficient. The recommended approach combines both:

Option B: Index/query layer normalization (current commit)

Normalize -0.0 to +0.0 in index building, query expression conversion, and scalar comparisons. This is the current implementation on branch fix-5868.

Handles: Historical data that already has -0.0 stored on disk. When these values flow through indexes and queries, they are normalized at the comparison/expression boundary.

Option C: Storage/ingestion layer normalization (additional work needed)

Normalize -0.0 to +0.0 when writing data to Lance files at the writer/encoding layer.

Handles: New data. Eliminates -0.0 at the source so all downstream code (indexes, scans, DataFusion filters) sees only +0.0. This removes the total_cmp vs IEEE 754 divergence entirely for new writes, since +0.0 < 0.0 is false under both semantics.

Why both are needed

Scenario	Option B alone	Option C alone	B + C
New data, index query	Correct	Correct (no -0.0 to compare)	Correct
New data, DataFusion scan filter	Diverges (index: IEEE 754, scan: total_cmp)	Correct (no -0.0 to compare)	Correct
Historical data, index query	Correct	Wrong (-0.0 still on disk)	Correct
Historical data, DataFusion scan filter	Diverges	Wrong	Diverges (unavoidable without rewrite)

Option B alone fails for new data: Lance index results (IEEE 754) will differ from DataFusion scan results (total_cmp) when the data contains -0.0.
Option C alone fails for historical data: existing Lance files still contain -0.0 on disk. Indexes built from historical data will encounter -0.0 values.
B + C together are fully correct for new data. For historical data, index queries are correct (Option B normalizes at comparison time), but a raw DataFusion scan filter on un-rewritten data may still include -0.0 in < 0.0 results due to total_cmp. This edge case is unavoidable without rewriting the data files, and is acceptable since:
- It only affects legacy data written before the fix
- Users can resolve it by rewriting/compacting their datasets
- The index path (which is the primary query path) gives correct results

Integration test update

The integration test test_query_float_special_values compares Lance scanner output against DataFusion SQL on the original in-memory batch. With Option C, the stored data no longer contains -0.0, so the test baseline must be constructed from data read back from the Lance dataset (post-normalization) rather than the pre-write in-memory batch. This is a test-only change.

Other Options (not recommended)

Option A: Match DataFusion/Arrow (`total_cmp` semantics)

Revert -0.0 normalization from OrderableScalarValue::Ord, bitmap training, and SargableQuery::to_expr. Keep only the ArrowScalar Eq/Hash normalization (which has its own contract independent of DataFusion).

Pros:

Lance index results match full-scan results (both use total_cmp)
Consistent with the Rust Arrow/DataFusion ecosystem
Integration tests pass as-is

Cons:

-0.0 treated differently from +0.0 everywhere (violates mathematical expectation)
Diverges from PyArrow/C++ Arrow behavior
Issue incorrect handling of -0.0 in comparison #5868 remains unfixed

Dear project maintainers, what are your thoughts on this?

init

3df893f

github-actions bot added the bug Something isn't working label Mar 20, 2026

LuciferYang changed the title ~~fix: normalize IEEE 754 -0.0 to +0.0 for consistent float comparison~~ fix: normalize IEEE 754 -0.0 to +0.0 for consistent float comparison Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: normalize IEEE 754 -0.0 to +0.0 for consistent float comparison#6236

fix: normalize IEEE 754 -0.0 to +0.0 for consistent float comparison#6236
LuciferYang wants to merge 1 commit intolance-format:mainfrom
LuciferYang:fix-5868

LuciferYang commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

codecov bot commented Mar 20, 2026

Uh oh!

LuciferYang commented Mar 20, 2026

Uh oh!

LuciferYang commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LuciferYang commented Mar 20, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Review

Minor observations (not blocking)

Uh oh!

codecov bot commented Mar 20, 2026

Codecov Report

Uh oh!

LuciferYang commented Mar 20, 2026

Uh oh!

LuciferYang commented Mar 20, 2026

Rust Arrow 57 (arrow-ord)

DataFusion 52.3.0

PyArrow 23 (C++ Arrow)

Summary Table

Conflict

Recommended Approach: Option B + C

Option B: Index/query layer normalization (current commit)

Option C: Storage/ingestion layer normalization (additional work needed)

Why both are needed

Integration test update

Other Options (not recommended)

Option A: Match DataFusion/Arrow (total_cmp semantics)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Rust Arrow 57 (`arrow-ord`)

Option A: Match DataFusion/Arrow (`total_cmp` semantics)