[python] Fix Ray read_paimon dropping nested projection (reads nested leaves as NULL)#8269
Merged
JingsongLi merged 2 commits intoJun 19, 2026
Merged
Conversation
… leaves as NULL) The Ray datasource rebuilds a worker-side TableRead from the split provider's (table, read_type, predicate, limit) but not its nested_name_paths, so a nested-leaf projection such as read_paimon(..., projection=['payload.a']) read every projected leaf as NULL: the worker treated the flattened leaf name as a missing top-level column. SplitProvider now exposes nested_name_paths() (resolved by the read builder for CatalogSplitProvider, carried from the source TableRead for PreResolvedSplitProvider), and RayDatasource forwards it into the per-task TableRead. Adds a Ray nested-projection regression test.
QuakeWang
reviewed
Jun 18, 2026
…w-up) Cover the second Ray read entry point fixed in this PR: TableRead.to_ray() uses PreResolvedSplitProvider, which must also forward nested_name_paths to the worker TableRead. Mirrors the read_paimon() nested-projection test, so both provider paths are guarded against reading a projected nested leaf as NULL.
008bba1 to
56da93b
Compare
Contributor
|
Reviewed the Ray nested projection fix. The new nested_name_paths plumbing is carried through both CatalogSplitProvider/read_paimon and PreResolvedSplitProvider/TableRead.to_ray, so the worker-side TableRead now has the same nested projection context as the normal read path. The added tests cover both Ray entry points. LGTM. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
RayDatasourcerebuilds the worker-sideTableReadfrom the split provider's(table, read_type, predicate, limit)but not itsnested_name_paths. As aresult
read_paimon(..., projection=['payload.a'])(a nested-leaf projection)reads every projected leaf as
NULL— the worker treats the flattened leafname (e.g.
payload_a) as a missing top-level column. The non-Ray read path isunaffected because
ReadBuilder.new_read()already forwardsnested_name_pathsto
TableRead.Fix
SplitProviderexposesnested_name_paths(): resolved via the read builderfor
CatalogSplitProvider, carried from the sourceTableReadforPreResolvedSplitProvider(TableRead.to_ray).RayDatasourceforwards it into the per-task workerTableRead.The change is a no-op for non-nested / top-level-only projections
(
nested_name_pathsisNonethere), so existing reads are unaffected.Tests
Adds
RayIntegrationTest.test_read_paimon_with_nested_projection, asserting a['id', 'payload.a']projection returns the real leaf values instead ofNULL.Does this PR introduce a user-facing change?
No.
Documentation
No documentation change needed.
Generative AI disclosure: drafted with AI assistance and reviewed by the author.