Skip to content

Conversation

@Yicong-Huang
Copy link
Contributor

What changes were proposed in this pull request?

This PR extends the PyArrow array cast tests to cover both safe=True and safe=False behaviors in each test case.

Changes:

  1. Modified _run_cast_tests method: Now tests both safe=True and safe=False modes for each test case
  2. Extended test case format: Support 3-tuple format (src_arr, expected_safe_true, expected_safe_false) where expected_safe_false=None means same behavior as safe=True
  3. Updated 200+ test cases: All overflow/truncation cases now explicitly specify the safe=False behavior (wrapping, truncation, or saturation)

Categories of safe=False behavior covered:

  • Integer overflow: Negative-to-unsigned wraps (e.g., -1255 for uint8)
  • Integer narrowing: Value wraps on overflow (e.g., 128 as int16 → -128 as int8)
  • Float-to-integer: Truncation and saturation for out-of-range values
  • Timestamp/Duration precision loss: Truncation when casting to coarser units
  • Decimal overflow: Wrapping behavior for values exceeding integer limits

Why are the changes needed?

Part of SPARK-54936. The existing tests only covered safe=True behavior. This PR adds comprehensive coverage for safe=False to ensure PySpark correctly handles PyArrow's overflow/truncation semantics, which is important for data processing pipelines that need to handle edge cases gracefully.

Does this PR introduce any user-facing change?

No

How was this patch tested?

All 39 tests in PyArrowNumericalCastTests pass with the updated test framework.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions
Copy link

JIRA Issue Information

=== Sub-task SPARK-54943 ===
Summary: Add tests for pa.Array.cast with overflow
Assignee: None
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

@zhengruifeng
Copy link
Contributor

@Yicong-Huang @fangchenli I am thinking whether we should use golden files instead.
If we use golden files, we don't need to touch the test files too much, just need to regenereate or compare a new file with new conditions (safe=False in this PR)

@Yicong-Huang
Copy link
Contributor Author

Yicong-Huang commented Jan 28, 2026

@Yicong-Huang @fangchenli I am thinking whether we should use golden files instead. If we use golden files, we don't need to touch the test files too much, just need to regenereate or compare a new file with new conditions (safe=False in this PR)

hmm I am really not a fan of golden file, especially for this kind of type related tests. To compare on a text based format, all python objects needs to be serialized in a way, (e.g., str(object) or repr(object)) which could cause edge cases missing. For example, a decimal128 and a decimal256 may have different digits initin it but could be serialized to be the same string (I am making it up as an example). We will need to make sure we don't introduce potential problems like this. This particular test for pa.Array.cast maybe sensitive to an extra serialization.

Maybe we can exercise golden file practice in future/other tests?

@zhengruifeng
Copy link
Contributor

@Yicong-Huang we can control the string representation and include intrested infromation in it.

e.g in UDF type coercion test, we store both value and type f{value}_{type(_)}

@zhengruifeng
Copy link
Contributor

thanks, merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants