Skip to content

[python] Fix Ray read_paimon dropping nested projection (reads nested leaves as NULL)#8269

Merged
JingsongLi merged 2 commits into
apache:masterfrom
TheR1sing3un:fix-ray-nested-projection
Jun 19, 2026
Merged

[python] Fix Ray read_paimon dropping nested projection (reads nested leaves as NULL)#8269
JingsongLi merged 2 commits into
apache:masterfrom
TheR1sing3un:fix-ray-nested-projection

Conversation

@TheR1sing3un

Copy link
Copy Markdown
Member

Purpose

RayDatasource rebuilds the worker-side TableRead from the split provider's
(table, read_type, predicate, limit) but not its nested_name_paths. As a
result read_paimon(..., projection=['payload.a']) (a nested-leaf projection)
reads every projected leaf as NULL — the worker treats the flattened leaf
name (e.g. payload_a) as a missing top-level column. The non-Ray read path is
unaffected because ReadBuilder.new_read() already forwards nested_name_paths
to TableRead.

Fix

  • SplitProvider exposes nested_name_paths(): resolved via the read builder
    for CatalogSplitProvider, carried from the source TableRead for
    PreResolvedSplitProvider (TableRead.to_ray).
  • RayDatasource forwards it into the per-task worker TableRead.

The change is a no-op for non-nested / top-level-only projections
(nested_name_paths is None there), so existing reads are unaffected.

Tests

Adds RayIntegrationTest.test_read_paimon_with_nested_projection, asserting a
['id', 'payload.a'] projection returns the real leaf values instead of NULL.

Does this PR introduce a user-facing change?

No.

Documentation

No documentation change needed.


Generative AI disclosure: drafted with AI assistance and reviewed by the author.

… leaves as NULL)

The Ray datasource rebuilds a worker-side TableRead from the split
provider's (table, read_type, predicate, limit) but not its
nested_name_paths, so a nested-leaf projection such as
read_paimon(..., projection=['payload.a']) read every projected leaf as
NULL: the worker treated the flattened leaf name as a missing top-level
column.

SplitProvider now exposes nested_name_paths() (resolved by the read
builder for CatalogSplitProvider, carried from the source TableRead for
PreResolvedSplitProvider), and RayDatasource forwards it into the
per-task TableRead. Adds a Ray nested-projection regression test.
Comment thread paimon-python/pypaimon/tests/ray_integration_test.py
…w-up)

Cover the second Ray read entry point fixed in this PR: TableRead.to_ray()
uses PreResolvedSplitProvider, which must also forward nested_name_paths to
the worker TableRead. Mirrors the read_paimon() nested-projection test, so
both provider paths are guarded against reading a projected nested leaf as
NULL.
@TheR1sing3un TheR1sing3un force-pushed the fix-ray-nested-projection branch from 008bba1 to 56da93b Compare June 18, 2026 15:34
@TheR1sing3un TheR1sing3un requested a review from QuakeWang June 18, 2026 15:43
@JingsongLi

Copy link
Copy Markdown
Contributor

Reviewed the Ray nested projection fix. The new nested_name_paths plumbing is carried through both CatalogSplitProvider/read_paimon and PreResolvedSplitProvider/TableRead.to_ray, so the worker-side TableRead now has the same nested projection context as the normal read path. The added tests cover both Ray entry points. LGTM.

@QuakeWang QuakeWang left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit 54545a9 into apache:master Jun 19, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants