[fix](variant) Keep empty key sparse during compaction#64641
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
There was a problem hiding this comment.
Pull request overview
Fixes a Variant compaction edge case where the empty JSON key path ("", which can also collide with the Variant root path representation) could be incorrectly materialized as a regular subcolumn during compaction—causing Tags[''] values to be lost (read back as NULL) when sparse hash sharding is enabled and default_variant_max_subcolumns_count = 0.
Changes:
- Update Variant compaction path-selection helpers to always keep empty paths in the sparse path set (never materialize as subcolumns).
- Add a regression test covering the sparse-bucket empty-key scenario and expected query output.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| regression-test/suites/variant_p0/test_variant_empty_key_sparse_bucket.groovy | Adds a non-concurrent regression suite that reproduces the empty-key loss after cumulative compaction with sparse hash sharding. |
| regression-test/data/variant_p0/test_variant_empty_key_sparse_bucket.out | Adds the expected result set for the new regression suite query on Tags['']. |
| be/src/exec/common/variant_util.cpp | Ensures empty paths are routed to sparse_path_set across compaction schema/path selection helpers to prevent incorrect subcolumn materialization. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } else { | ||
| // Apply all paths as subcolumns | ||
| for (const auto& [path, _] : stats) { |
|
/review |
There was a problem hiding this comment.
I reviewed the full PR diff and the surrounding variant compaction/read-write path. I did not find a blocking correctness issue.
Code-review checkpoint conclusions:
- Goal/test: The patch keeps the empty JSON key path out of materialized subcolumn selection during compaction and preserves it through sparse storage; the new regression exercises the sparse-bucket/max-subcolumns-0 case after cumulative compaction.
- Scope: The code change is small and focused on the compaction path-selection helpers that can otherwise place
""insub_path_set. - Concurrency/lifecycle: No new shared state, locks, atomics, lifecycle ownership, or static initialization concerns are introduced.
- Configuration/compatibility: No new config items or FE-BE protocol/storage format changes are introduced. The change affects how compaction chooses the temporary output schema and remains compatible with existing rowsets.
- Parallel paths: I checked the path stats, typed path, subpath, and data-type materialization routes; empty paths are routed away from materialized subcolumns in the relevant compaction helpers. The existing Copilot thread already covers the stale comment in
get_subpaths, so I am not duplicating that comment. - Tests/results: The regression follows the Doris regression conventions: drop before use, hardcoded table name, ordered query output, generated
.out, and coverage of the empty-key sparse-bucket case. I did not run the regression locally in this review environment. - Observability/performance: No new observability appears necessary for this narrow compaction selection fix; the added checks are constant-time and not a meaningful hot-path regression.
- Data correctness: The downstream sparse merge reader uses
sparse_path_setto pull old materialized empty-key subcolumns back into sparse storage andsub_path_setto suppress materialized paths. The patch removes the empty key from materialization/suppression and keeps sparse values visible after compaction.
User focus: no additional focus points were provided, so I performed a full PR review. CI note: the failed macOS BE UT job appears unrelated to this patch; it exits before build/test because the runner reports JDK 25 instead of the required JDK 17.
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
2c75246 to
a967c66
Compare
|
run buildall |
|
/review |
|
run check_coverage |
There was a problem hiding this comment.
No blocking findings from this review.
I traced the change through VariantCompactionUtil::get_extended_compaction_schema, the empty-key/root distinction in PathInData, sparse merge reader consumption of PathsSetInfo, and the added BE/regression coverage. The updated helpers consistently keep the empty variant path out of materialized compaction subcolumns and route it to sparse output, including the variant_max_subcolumns_count = 0 + sparse bucket case covered by the new regression. The existing Copilot thread about the stale Apply all paths as subcolumns comment is already known; I did not duplicate it.
Critical checkpoint conclusions:
- Goal and proof: addresses empty-key data loss during variant compaction; covered by focused BE unit assertions and an end-to-end regression that checks
Tags['']after cumulative compaction. - Scope: small, focused, and localized to variant compaction path selection plus tests.
- Concurrency/lifecycle: no new concurrency, shared mutable state, locks, atomics, or lifecycle/static-init behavior.
- Config/compatibility/protocol: no new config item, storage format, or FE/BE protocol change; existing variant options are used.
- Parallel paths: limited top-N selection, unlimited selection, typed paths, subpath materialization, and data-type-driven materialization all handle the empty path consistently.
- Data correctness/transactions: compaction transaction and visible-version behavior are unchanged; the empty key is preserved by keeping it in sparse output instead of colliding with the root/materialized subcolumn path.
- Tests/results: added regression output is ordered by the selected value and the BE helper tests cover the changed branches. I did not run the test suite during this review.
- Observability/performance: no new observability requirement or material performance concern found.
- BE exec-specific checks: no pipeline dependency, memory reservation/spill, dependency concurrency, or atomic behavior is touched.
- BE test-specific checks: access-control bypass note is not relevant to these unit changes.
- User focus: no additional user-provided focus points.
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 29609 ms |
TPC-H: Total hot run time: 28796 ms |
TPC-DS: Total hot run time: 175766 ms |
ClickBench: Total hot run time: 25.29 s |
What problem does this PR solve?
Issue Number: DORIS-26505
Related PR: N/A
Problem Summary:
On master, variant compaction can materialize the empty JSON key path as a regular subcolumn when
default_variant_max_subcolumns_count = 0and sparse hash sharding is enabled. The empty path also represents the variant root path, so after cumulative compaction the values fromTags['']can be lost and read back as NULL.This PR keeps empty paths in the sparse path set instead of materializing them as subcolumns in all compaction path selection helpers, and adds a regression that reproduces the sparse-bucket empty-key case.
Release note
None
Check List (For Author)
Manual test / local verification:
Built master BE/FE with
BUILD_TYPE=ASAN USE_MEM_TRACKER=ON bash build.sh --be --fe.Reproduced on unmodified master with
default_variant_max_subcolumns_count = 0,default_variant_enable_doc_mode = false,use_v3_storage_format = false,default_variant_enable_typed_paths_to_sparse = false,default_variant_sparse_hash_shard_count = 3; before compactionTags['']returned the inserted empty-key values, after cumulative compaction all rows read as NULL.Rebuilt BE after the fix with
BUILD_TYPE=ASAN USE_MEM_TRACKER=ON bash build.sh --be.Verified the same manual repro after cumulative compaction preserves the empty-key values.
./run-regression-test.sh --run --conf tmp/regression-conf.auto.groovy -d variant_p0 -s test_variant_empty_key_sparse_bucket -forceGenOut./run-regression-test.sh --run --conf tmp/regression-conf.auto.groovy -d variant_p0 -s test_variant_empty_key_sparse_bucket./run-regression-test.sh --run --conf tmp/regression-conf.auto.groovy -d variant_p0 -s regression_test_variant_column_name./run-regression-test.sh --run --conf tmp/regression-conf.auto.groovy -d variant_p0 -s test_variant_compaction_empty_path_bugBehavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)