Skip to content

fix: normalize IEEE 754 -0.0 to +0.0 for consistent float comparison#6236

Open
LuciferYang wants to merge 1 commit intolance-format:mainfrom
LuciferYang:fix-5868
Open

fix: normalize IEEE 754 -0.0 to +0.0 for consistent float comparison#6236
LuciferYang wants to merge 1 commit intolance-format:mainfrom
LuciferYang:fix-5868

Conversation

@LuciferYang
Copy link
Contributor

Summary

Fixes #5868

IEEE 754 specifies that -0.0 == 0.0, but total_cmp and bit-level representations distinguish them. This caused incorrect results in scalar filters (e.g., x < 0.0 incorrectly included -0.0) and inconsistent Eq/Ord/Hash behavior for float scalars.

This PR normalizes -0.0 to +0.0 at each layer where float values are compared or hashed:

  • ArrowScalar (arrow-scalar) — normalize Float16/Float32/Float64 arrays before row encoding so that Eq, Ord, and Hash treat -0.0 and +0.0 identically
  • OrderableScalarValue (lance-index/btree) — normalize before total_cmp in the manual Ord impl for Float16/Float32/Float64
  • SargableQuery::to_expr / BloomFilterQuery::to_expr (lance-index/scalar) — normalize float ScalarValue literals when converting queries to DataFusion expressions
  • safe_coerce_scalar (lance-datafusion/expr) — normalize during float-to-float coercion

Test plan

  • test_neg_zero_equals_pos_zero_f16 / f32 / f64ArrowScalar equality and ordering
  • test_neg_zero_hash_equals_pos_zeroArrowScalar hash consistency
  • test_neg_zero_cmp_f16 / f32 / f64OrderableScalarValue ordering
  • test_neg_zero_less_than_positive_f64 — normalized zero still orders correctly against other values
  • Existing property test sorted_array_produces_sorted_scalars continues to pass
  • cargo clippy and cargo fmt clean

@github-actions
Copy link
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@github-actions github-actions bot added the bug Something isn't working label Mar 20, 2026
@github-actions
Copy link
Contributor

Review

Clean, well-structured fix for the IEEE 754 -0.0 vs +0.0 inconsistency. The multi-layer approach (ArrowScalar row encoding, OrderableScalarValue Ord, query expression literals, coercion) is the right strategy — normalizing at each boundary rather than trying a single choke-point.

Minor observations (not blocking)

Unnecessary allocations for +0.0: The v == 0.0 check matches both -0.0 and +0.0, so normalize_float_neg_zero allocates a new length-1 array even when the value is already +0.0. A tighter check like v.to_bits() == f32::NEG_ZERO.to_bits() (or v.is_sign_negative() && v == 0.0) would skip the allocation for the common +0.0 case. Same applies to normalize_float_scalar and the Ord impl. The cost is negligible for scalars, but it would express the intent more precisely.

DRY opportunity: normalize_float_scalar in lance-index/src/scalar.rs and the inline normalization in btree.rs Ord impl do the same conceptual operation on different representations. Not a problem today, but if more float-handling sites appear, a shared normalization trait or utility could reduce surface area.

Overall: good fix, good tests, good PR description. LGTM.

@LuciferYang LuciferYang changed the title fix: normalize IEEE 754 -0.0 to +0.0 for consistent float comparison fix: normalize IEEE 754 -0.0 to +0.0 for consistent float comparison Mar 20, 2026
@codecov
Copy link

codecov bot commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 87.69231% with 16 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-datafusion/src/expr.rs 40.00% 5 Missing and 1 partial ⚠️
rust/lance-index/src/scalar.rs 75.00% 6 Missing ⚠️
rust/arrow-scalar/src/lib.rs 96.55% 1 Missing and 1 partial ⚠️
rust/lance-index/src/scalar/btree.rs 94.73% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@LuciferYang
Copy link
Contributor Author

There are test failures, and I need to investigate them.

@LuciferYang
Copy link
Contributor Author

The test failure is due to inconsistent behaviors between Rust Arrow/DataFusion and PyArrow.

Rust Arrow 57 (arrow-ord)

Arrow's comparison kernels (lt, eq, gt, etc.) use total_cmp as defined in IEEE 754-2008 totalOrder predicate. From the Arrow docs:

For floating values like f32 and f64, this comparison produces an ordering in accordance to the totalOrder predicate as defined in the IEEE 754 (2008 revision) floating point standard. Note that totalOrder treats positive and negative zeros as different. If it is necessary to treat them as equal, please normalize zeros before calling this kernel.

Under total_cmp: -0.0 < 0.0 is true, and -0.0 == 0.0 is false.

DataFusion 52.3.0

DataFusion delegates comparison to Arrow's kernels via apply_cmp -> arrow::compute::kernels::cmp::lt. Verified with DataFusion Python:

DataFusion SQL: value < 0.0
  ids: [1, 3, 6, 7]        ← includes -0.0 (id=1)

DataFusion SQL: value = 0.0
  ids: [0]                  ← excludes -0.0

PyArrow 23 (C++ Arrow)

PyArrow follows IEEE 754 comparison semantics:

less(-0.0, 0.0)       = False
equal(-0.0, 0.0)      = True

Summary Table

Operation IEEE 754 total_cmp (Rust Arrow 57 / DataFusion 52) PyArrow (C++ Arrow)
-0.0 < 0.0 false true false
-0.0 == 0.0 true false true
-0.0 <= 0.0 true true true

Conflict

This pr normalizes -0.0 to +0.0 in multiple locations to implement IEEE 754 semantics:

  • ArrowScalar row encoding (Eq/Ord/Hash consistency)
  • OrderableScalarValue::Ord (btree index comparisons)
  • SargableQuery::to_expr / BloomFilterQuery::to_expr (query expression literals)
  • safe_coerce_scalar (float coercion)
  • Bitmap index training (HashMap key normalization)

This makes Lance's index results inconsistent with DataFusion/Arrow scan results:

  • Lance bitmap index (value < 0.0): excludes -0.0 (IEEE 754)
  • DataFusion scan (value < 0.0): includes -0.0 (total_cmp)

The integration test test_query_float_special_values catches this inconsistency because it validates Lance scanner output against DataFusion SQL output.

Recommended Approach: Option B + C

After analysis, neither Option B (index/query normalization) nor Option C (storage normalization) alone is sufficient. The recommended approach combines both:

Option B: Index/query layer normalization (current commit)

Normalize -0.0 to +0.0 in index building, query expression conversion, and scalar comparisons. This is the current implementation on branch fix-5868.

Handles: Historical data that already has -0.0 stored on disk. When these values flow through indexes and queries, they are normalized at the comparison/expression boundary.

Option C: Storage/ingestion layer normalization (additional work needed)

Normalize -0.0 to +0.0 when writing data to Lance files at the writer/encoding layer.

Handles: New data. Eliminates -0.0 at the source so all downstream code (indexes, scans, DataFusion filters) sees only +0.0. This removes the total_cmp vs IEEE 754 divergence entirely for new writes, since +0.0 < 0.0 is false under both semantics.

Why both are needed

Scenario Option B alone Option C alone B + C
New data, index query Correct Correct (no -0.0 to compare) Correct
New data, DataFusion scan filter Diverges (index: IEEE 754, scan: total_cmp) Correct (no -0.0 to compare) Correct
Historical data, index query Correct Wrong (-0.0 still on disk) Correct
Historical data, DataFusion scan filter Diverges Wrong Diverges (unavoidable without rewrite)
  • Option B alone fails for new data: Lance index results (IEEE 754) will differ from DataFusion scan results (total_cmp) when the data contains -0.0.
  • Option C alone fails for historical data: existing Lance files still contain -0.0 on disk. Indexes built from historical data will encounter -0.0 values.
  • B + C together are fully correct for new data. For historical data, index queries are correct (Option B normalizes at comparison time), but a raw DataFusion scan filter on un-rewritten data may still include -0.0 in < 0.0 results due to total_cmp. This edge case is unavoidable without rewriting the data files, and is acceptable since:
    • It only affects legacy data written before the fix
    • Users can resolve it by rewriting/compacting their datasets
    • The index path (which is the primary query path) gives correct results

Integration test update

The integration test test_query_float_special_values compares Lance scanner output against DataFusion SQL on the original in-memory batch. With Option C, the stored data no longer contains -0.0, so the test baseline must be constructed from data read back from the Lance dataset (post-normalization) rather than the pre-write in-memory batch. This is a test-only change.

Other Options (not recommended)

Option A: Match DataFusion/Arrow (total_cmp semantics)

Revert -0.0 normalization from OrderableScalarValue::Ord, bitmap training, and SargableQuery::to_expr. Keep only the ArrowScalar Eq/Hash normalization (which has its own contract independent of DataFusion).

Pros:

  • Lance index results match full-scan results (both use total_cmp)
  • Consistent with the Rust Arrow/DataFusion ecosystem
  • Integration tests pass as-is

Cons:

Dear project maintainers, what are your thoughts on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

incorrect handling of -0.0 in comparison

1 participant