fix: normalize IEEE 754 -0.0 to +0.0 for consistent float comparison#6236
fix: normalize IEEE 754 -0.0 to +0.0 for consistent float comparison#6236LuciferYang wants to merge 1 commit intolance-format:mainfrom
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
ReviewClean, well-structured fix for the IEEE 754 -0.0 vs +0.0 inconsistency. The multi-layer approach (ArrowScalar row encoding, OrderableScalarValue Ord, query expression literals, coercion) is the right strategy — normalizing at each boundary rather than trying a single choke-point. Minor observations (not blocking)Unnecessary allocations for +0.0: The DRY opportunity: Overall: good fix, good tests, good PR description. LGTM. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
There are test failures, and I need to investigate them. |
|
The test failure is due to inconsistent behaviors between Rust Arrow/DataFusion and PyArrow. Rust Arrow 57 (
|
| Operation | IEEE 754 | total_cmp (Rust Arrow 57 / DataFusion 52) |
PyArrow (C++ Arrow) |
|---|---|---|---|
-0.0 < 0.0 |
false | true | false |
-0.0 == 0.0 |
true | false | true |
-0.0 <= 0.0 |
true | true | true |
Conflict
This pr normalizes -0.0 to +0.0 in multiple locations to implement IEEE 754 semantics:
ArrowScalarrow encoding (Eq/Ord/Hash consistency)OrderableScalarValue::Ord(btree index comparisons)SargableQuery::to_expr/BloomFilterQuery::to_expr(query expression literals)safe_coerce_scalar(float coercion)- Bitmap index training (HashMap key normalization)
This makes Lance's index results inconsistent with DataFusion/Arrow scan results:
- Lance bitmap index (
value < 0.0): excludes-0.0(IEEE 754) - DataFusion scan (
value < 0.0): includes-0.0(total_cmp)
The integration test test_query_float_special_values catches this inconsistency because it validates Lance scanner output against DataFusion SQL output.
Recommended Approach: Option B + C
After analysis, neither Option B (index/query normalization) nor Option C (storage normalization) alone is sufficient. The recommended approach combines both:
Option B: Index/query layer normalization (current commit)
Normalize -0.0 to +0.0 in index building, query expression conversion, and scalar comparisons. This is the current implementation on branch fix-5868.
Handles: Historical data that already has -0.0 stored on disk. When these values flow through indexes and queries, they are normalized at the comparison/expression boundary.
Option C: Storage/ingestion layer normalization (additional work needed)
Normalize -0.0 to +0.0 when writing data to Lance files at the writer/encoding layer.
Handles: New data. Eliminates -0.0 at the source so all downstream code (indexes, scans, DataFusion filters) sees only +0.0. This removes the total_cmp vs IEEE 754 divergence entirely for new writes, since +0.0 < 0.0 is false under both semantics.
Why both are needed
| Scenario | Option B alone | Option C alone | B + C |
|---|---|---|---|
| New data, index query | Correct | Correct (no -0.0 to compare) | Correct |
| New data, DataFusion scan filter | Diverges (index: IEEE 754, scan: total_cmp) | Correct (no -0.0 to compare) | Correct |
| Historical data, index query | Correct | Wrong (-0.0 still on disk) | Correct |
| Historical data, DataFusion scan filter | Diverges | Wrong | Diverges (unavoidable without rewrite) |
- Option B alone fails for new data: Lance index results (IEEE 754) will differ from DataFusion scan results (
total_cmp) when the data contains-0.0. - Option C alone fails for historical data: existing Lance files still contain
-0.0on disk. Indexes built from historical data will encounter-0.0values. - B + C together are fully correct for new data. For historical data, index queries are correct (Option B normalizes at comparison time), but a raw DataFusion scan filter on un-rewritten data may still include
-0.0in< 0.0results due tototal_cmp. This edge case is unavoidable without rewriting the data files, and is acceptable since:- It only affects legacy data written before the fix
- Users can resolve it by rewriting/compacting their datasets
- The index path (which is the primary query path) gives correct results
Integration test update
The integration test test_query_float_special_values compares Lance scanner output against DataFusion SQL on the original in-memory batch. With Option C, the stored data no longer contains -0.0, so the test baseline must be constructed from data read back from the Lance dataset (post-normalization) rather than the pre-write in-memory batch. This is a test-only change.
Other Options (not recommended)
Option A: Match DataFusion/Arrow (total_cmp semantics)
Revert -0.0 normalization from OrderableScalarValue::Ord, bitmap training, and SargableQuery::to_expr. Keep only the ArrowScalar Eq/Hash normalization (which has its own contract independent of DataFusion).
Pros:
- Lance index results match full-scan results (both use
total_cmp) - Consistent with the Rust Arrow/DataFusion ecosystem
- Integration tests pass as-is
Cons:
-0.0treated differently from+0.0everywhere (violates mathematical expectation)- Diverges from PyArrow/C++ Arrow behavior
- Issue incorrect handling of -0.0 in comparison #5868 remains unfixed
Dear project maintainers, what are your thoughts on this?
Summary
Fixes #5868
IEEE 754 specifies that
-0.0 == 0.0, buttotal_cmpand bit-level representations distinguish them. This caused incorrect results in scalar filters (e.g.,x < 0.0incorrectly included-0.0) and inconsistentEq/Ord/Hashbehavior for float scalars.This PR normalizes -0.0 to +0.0 at each layer where float values are compared or hashed:
ArrowScalar(arrow-scalar) — normalize Float16/Float32/Float64 arrays before row encoding so thatEq,Ord, andHashtreat -0.0 and +0.0 identicallyOrderableScalarValue(lance-index/btree) — normalize beforetotal_cmpin the manualOrdimpl for Float16/Float32/Float64SargableQuery::to_expr/BloomFilterQuery::to_expr(lance-index/scalar) — normalize floatScalarValueliterals when converting queries to DataFusion expressionssafe_coerce_scalar(lance-datafusion/expr) — normalize during float-to-float coercionTest plan
test_neg_zero_equals_pos_zero_f16/f32/f64—ArrowScalarequality and orderingtest_neg_zero_hash_equals_pos_zero—ArrowScalarhash consistencytest_neg_zero_cmp_f16/f32/f64—OrderableScalarValueorderingtest_neg_zero_less_than_positive_f64— normalized zero still orders correctly against other valuessorted_array_produces_sorted_scalarscontinues to passcargo clippyandcargo fmtclean