[python] Support random sample for append table by discivigour · Pull Request #7014 · apache/paimon

discivigour · 2026-01-12T14:06:25Z

Purpose

Support random sample for append table.
The current FullStartingScanner class is a bit bloated. I will refactor it in the next pr.

Tests

RESTSimpleTest.test_with_sample()
DataBlobWriterTest.test_data_blob_writer_with_sample()

API and Format

TableScan.with_sample()

Documentation

JingsongLi

Can you do refactor in a separate PR?

JingsongLi · 2026-01-13T05:25:01Z

paimon-python/pypaimon/read/split.py

    _row_count: int
    _file_size: int
    shard_file_idx_map: Dict[str, Tuple[int, int]] = field(default_factory=dict)  # file_name -> (start_idx, end_idx)
+    sample_file_idx_map: Dict[str, List[int]] = field(default_factory=dict)  # file_name -> [sample_indexes]


We need to merge sample_file_idx_map into shard_file_idx_map, maybe just introduce shard_file_idx_map: Dict[str, List[Range]].

discivigour · 2026-01-13T05:42:09Z

Can you do refactor in a separate PR?

Sure.

XiaoHongbo-Hope · 2026-01-13T05:58:22Z

paimon-python/pypaimon/read/reader/shard_batch_reader.py

+        if isinstance(self.reader.format_reader, FormatBlobReader):
+            # For blob reader, pass begin_idx and end_idx parameters
+            self.sample_idx += 1
+            return self.reader.read_arrow_batch(start_idx=self.sample_idx - 1, end_idx=self.sample_idx)


sample_idx or self.sample_positions[self.sample_idx]

self.sample_positions[self.sample_idx] is right.

JingsongLi

Please refactor full_starting_scanner.py first. It is too complicated.

discivigour · 2026-01-13T12:07:08Z

Please refactor full_starting_scanner.py first. It is too complicated.

👌

# Conflicts: # paimon-python/pypaimon/read/scanner/full_starting_scanner.py # paimon-python/pypaimon/read/split.py # paimon-python/pypaimon/read/split_read.py

XiaoHongbo-Hope · 2026-01-16T10:51:20Z

paimon-python/pypaimon/read/split_read.py

-
-        for bunch in fields_files:
-            if bunch.row_count() != row_count:
-                raise ValueError("All files in a field merge split should have the same row count.")


why remove this?

When sampling, only a part of blob files for a data file were filtered out together. So the row numbers are different.

XiaoHongbo-Hope · 2026-01-16T11:17:46Z

paimon-python/pypaimon/read/reader/sample_batch_reader.py

+            if take_idxes:
+                return batch.take(take_idxes)
+            else:  # batch is outside the desired range
+                return self.read_arrow_batch()


I'm afraid RecursionError will be raised when many batches are outside the sample range in production with large files and sparse sampling.

JingsongLi · 2026-01-19T03:02:45Z

I am rather doubtful about whether the sampling we provided meets the business requirements. I will do more research.

discivigour · 2026-01-19T03:07:30Z

I am rather doubtful about whether the sampling we provided meets the business requirements. I will do more research.

👌

discivigour force-pushed the randomSample branch from 2b7ac1c to 1dbd073 Compare January 13, 2026 03:23

discivigour closed this Jan 13, 2026

discivigour reopened this Jan 13, 2026

discivigour marked this pull request as ready for review January 13, 2026 03:44

JingsongLi reviewed Jan 13, 2026

View reviewed changes

XiaoHongbo-Hope reviewed Jan 13, 2026

View reviewed changes

discivigour force-pushed the randomSample branch from 1775a35 to c8e841a Compare January 13, 2026 06:52

JingsongLi requested changes Jan 13, 2026

View reviewed changes

umi added 6 commits January 15, 2026 11:21

noRefactor

ffd7556

# Conflicts: # paimon-python/pypaimon/read/scanner/full_starting_scanner.py # paimon-python/pypaimon/read/split.py # paimon-python/pypaimon/read/split_read.py

ver1

9b54f1c

ver1

e432b54

bitmap

5032579

fix

86262e4

fix

a42a2ce

discivigour force-pushed the randomSample branch from c8e841a to a42a2ce Compare January 16, 2026 06:11

add

d0fa741

discivigour marked this pull request as draft January 16, 2026 06:52

FIX

d3f5d3b

discivigour marked this pull request as ready for review January 16, 2026 08:22

lint

bce0491

XiaoHongbo-Hope reviewed Jan 16, 2026

View reviewed changes

iter

acc80be

discivigour requested a review from JingsongLi January 19, 2026 02:32

Conversation

discivigour commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

API and Format

Documentation

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

discivigour commented Jan 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

discivigour commented Jan 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JingsongLi commented Jan 19, 2026

Uh oh!

discivigour commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

discivigour commented Jan 12, 2026 •

edited

Loading