-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-55224][PYTHON] Use Spark DataType as ground truth in Pandas-Arrow serialization #53992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[SPARK-55224][PYTHON] Use Spark DataType as ground truth in Pandas-Arrow serialization #53992
Conversation
JIRA Issue Information=== Improvement SPARK-55224 === This comment was automatically generated by GitHub Actions |
…tor/spark-type-ground-truth
…tor/spark-type-ground-truth # Conflicts: # python/pyspark/sql/pandas/serializers.py
| is_struct_type = ( | ||
| spark_type is not None | ||
| and isinstance(spark_type, StructType) | ||
| and not isinstance(spark_type, VariantType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed, StructType can not be a VariantType
we have
and pa.types.is_struct(arrow_type)
and not is_variant(arrow_type)
because VariantType in arrow is represented by a pa.struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it! thanks for the explanation! will change this.
| spark_type : DataType, optional | ||
| If None, spark type converted from arrow_type will be used | ||
| arrow_cast: bool, optional | ||
| The Spark type to use. If None, pyarrow's inferred type will be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, the spark type here is the return type?
I think it should never be None?
| df: "pd.DataFrame", | ||
| arrow_struct_type: "pa.StructType", | ||
| spark_type: Optional[StructType] = None, | ||
| spark_type: StructType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we rename return type spark_type to return_type to be consistent with worker.py?
We have input_type in this file for input schema, spark_type is kind of confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
python/pyspark/worker.py
Outdated
| def to_output_format(result): | ||
| # Convert Series of dicts/Rows to DataFrame for struct types | ||
| if isinstance(return_type, StructType): | ||
| return pd.DataFrame(list(result)) | ||
| return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def to_output_format(result): | |
| # Convert Series of dicts/Rows to DataFrame for struct types | |
| if isinstance(return_type, StructType): | |
| return pd.DataFrame(list(result)) | |
| return result | |
| if isinstance(return_type, StructType): | |
| # Convert Series of dicts/Rows to DataFrame for struct types | |
| def to_output_format(result): | |
| return pd.DataFrame(list(result)) | |
| else: | |
| def to_output_format(result): | |
| return result |
What changes were proposed in this pull request?
Let
_create_batchand_create_arrayin PySpark's Pandas serializers to use Spark'sDataTypeas the single source of truth, deriving Arrow types internally when needed.Before: Callers in
worker.pypre-computedarrow_return_type = to_arrow_type(return_type, ...)and passed botharrow_typeandspark_typethrough the serialization pipeline.After: Callers pass only
spark_type(Spark DataType). The serializers derivearrow_typeinternally viato_arrow_type().Key changes:
worker.pyupdated to yieldreturn_typeinstead ofarrow_return_typeArrowStreamArrowUDFSerializer) unchanged - they still passarrow_typedirectlyWhy are the changes needed?
spark_typeis the canonical type representation defined by usersarrow_type_create_batchand_create_arraynow follow the same patternDoes this PR introduce any user-facing change?
No. This is an internal refactoring with no user-facing API changes.
How was this patch tested?
Existing tests.
Was this patch authored or co-authored using generative AI tooling?
No.