[python] Support random sample for append table#7014
[python] Support random sample for append table#7014discivigour wants to merge 10 commits intoapache:masterfrom
Conversation
2b7ac1c to
1dbd073
Compare
JingsongLi
left a comment
There was a problem hiding this comment.
Can you do refactor in a separate PR?
paimon-python/pypaimon/read/split.py
Outdated
| _row_count: int | ||
| _file_size: int | ||
| shard_file_idx_map: Dict[str, Tuple[int, int]] = field(default_factory=dict) # file_name -> (start_idx, end_idx) | ||
| sample_file_idx_map: Dict[str, List[int]] = field(default_factory=dict) # file_name -> [sample_indexes] |
There was a problem hiding this comment.
We need to merge sample_file_idx_map into shard_file_idx_map, maybe just introduce shard_file_idx_map: Dict[str, List[Range]].
Sure. |
| if isinstance(self.reader.format_reader, FormatBlobReader): | ||
| # For blob reader, pass begin_idx and end_idx parameters | ||
| self.sample_idx += 1 | ||
| return self.reader.read_arrow_batch(start_idx=self.sample_idx - 1, end_idx=self.sample_idx) |
There was a problem hiding this comment.
sample_idx or self.sample_positions[self.sample_idx]
There was a problem hiding this comment.
self.sample_positions[self.sample_idx] is right.
1775a35 to
c8e841a
Compare
JingsongLi
left a comment
There was a problem hiding this comment.
Please refactor full_starting_scanner.py first. It is too complicated.
👌 |
c8e841a to
a42a2ce
Compare
|
|
||
| for bunch in fields_files: | ||
| if bunch.row_count() != row_count: | ||
| raise ValueError("All files in a field merge split should have the same row count.") |
There was a problem hiding this comment.
When sampling, only a part of blob files for a data file were filtered out together. So the row numbers are different.
| if take_idxes: | ||
| return batch.take(take_idxes) | ||
| else: # batch is outside the desired range | ||
| return self.read_arrow_batch() |
There was a problem hiding this comment.
I'm afraid RecursionError will be raised when many batches are outside the sample range in production with large files and sparse sampling.
|
I am rather doubtful about whether the sampling we provided meets the business requirements. I will do more research. |
👌 |
Purpose
Tests
API and Format
Documentation