Skip to content

[fix](variant) Keep empty key sparse during compaction#64641

Open
eldenmoon wants to merge 1 commit into
apache:masterfrom
eldenmoon:branch-doris-26505-master
Open

[fix](variant) Keep empty key sparse during compaction#64641
eldenmoon wants to merge 1 commit into
apache:masterfrom
eldenmoon:branch-doris-26505-master

Conversation

@eldenmoon

Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: DORIS-26505

Related PR: N/A

Problem Summary:

On master, variant compaction can materialize the empty JSON key path as a regular subcolumn when default_variant_max_subcolumns_count = 0 and sparse hash sharding is enabled. The empty path also represents the variant root path, so after cumulative compaction the values from Tags[''] can be lost and read back as NULL.

This PR keeps empty paths in the sparse path set instead of materializing them as subcolumns in all compaction path selection helpers, and adds a regression that reproduces the sparse-bucket empty-key case.

Release note

None

Check List (For Author)

  • Test
    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason

Manual test / local verification:

  • Built master BE/FE with BUILD_TYPE=ASAN USE_MEM_TRACKER=ON bash build.sh --be --fe.

  • Reproduced on unmodified master with default_variant_max_subcolumns_count = 0, default_variant_enable_doc_mode = false, use_v3_storage_format = false, default_variant_enable_typed_paths_to_sparse = false, default_variant_sparse_hash_shard_count = 3; before compaction Tags[''] returned the inserted empty-key values, after cumulative compaction all rows read as NULL.

  • Rebuilt BE after the fix with BUILD_TYPE=ASAN USE_MEM_TRACKER=ON bash build.sh --be.

  • Verified the same manual repro after cumulative compaction preserves the empty-key values.

  • ./run-regression-test.sh --run --conf tmp/regression-conf.auto.groovy -d variant_p0 -s test_variant_empty_key_sparse_bucket -forceGenOut

  • ./run-regression-test.sh --run --conf tmp/regression-conf.auto.groovy -d variant_p0 -s test_variant_empty_key_sparse_bucket

  • ./run-regression-test.sh --run --conf tmp/regression-conf.auto.groovy -d variant_p0 -s regression_test_variant_column_name

  • ./run-regression-test.sh --run --conf tmp/regression-conf.auto.groovy -d variant_p0 -s test_variant_compaction_empty_path_bug

  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Copilot AI review requested due to automatic review settings June 18, 2026 10:59
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@eldenmoon

Copy link
Copy Markdown
Member Author

run buildall

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a Variant compaction edge case where the empty JSON key path ("", which can also collide with the Variant root path representation) could be incorrectly materialized as a regular subcolumn during compaction—causing Tags[''] values to be lost (read back as NULL) when sparse hash sharding is enabled and default_variant_max_subcolumns_count = 0.

Changes:

  • Update Variant compaction path-selection helpers to always keep empty paths in the sparse path set (never materialize as subcolumns).
  • Add a regression test covering the sparse-bucket empty-key scenario and expected query output.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
regression-test/suites/variant_p0/test_variant_empty_key_sparse_bucket.groovy Adds a non-concurrent regression suite that reproduces the empty-key loss after cumulative compaction with sparse hash sharding.
regression-test/data/variant_p0/test_variant_empty_key_sparse_bucket.out Adds the expected result set for the new regression suite query on Tags[''].
be/src/exec/common/variant_util.cpp Ensures empty paths are routed to sparse_path_set across compaction schema/path selection helpers to prevent incorrect subcolumn materialization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1014 to 1016
} else {
// Apply all paths as subcolumns
for (const auto& [path, _] : stats) {
@eldenmoon

Copy link
Copy Markdown
Member Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the full PR diff and the surrounding variant compaction/read-write path. I did not find a blocking correctness issue.

Code-review checkpoint conclusions:

  • Goal/test: The patch keeps the empty JSON key path out of materialized subcolumn selection during compaction and preserves it through sparse storage; the new regression exercises the sparse-bucket/max-subcolumns-0 case after cumulative compaction.
  • Scope: The code change is small and focused on the compaction path-selection helpers that can otherwise place "" in sub_path_set.
  • Concurrency/lifecycle: No new shared state, locks, atomics, lifecycle ownership, or static initialization concerns are introduced.
  • Configuration/compatibility: No new config items or FE-BE protocol/storage format changes are introduced. The change affects how compaction chooses the temporary output schema and remains compatible with existing rowsets.
  • Parallel paths: I checked the path stats, typed path, subpath, and data-type materialization routes; empty paths are routed away from materialized subcolumns in the relevant compaction helpers. The existing Copilot thread already covers the stale comment in get_subpaths, so I am not duplicating that comment.
  • Tests/results: The regression follows the Doris regression conventions: drop before use, hardcoded table name, ordered query output, generated .out, and coverage of the empty-key sparse-bucket case. I did not run the regression locally in this review environment.
  • Observability/performance: No new observability appears necessary for this narrow compaction selection fix; the added checks are constant-time and not a meaningful hot-path regression.
  • Data correctness: The downstream sparse merge reader uses sparse_path_set to pull old materialized empty-key subcolumns back into sparse storage and sub_path_set to suppress materialized paths. The patch removes the empty key from materialization/suppression and keeps sparse values visible after compaction.

User focus: no additional focus points were provided, so I performed a full PR review. CI note: the failed macOS BE UT job appears unrelated to this patch; it exits before build/test because the runner reports JDK 25 instead of the required JDK 17.

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 25.00% (5/20) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.42% (21361/39252)
Line Coverage 38.03% (204136/536840)
Region Coverage 34.01% (160099/470742)
Branch Coverage 35.04% (70155/200207)

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 40.00% (8/20) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.08% (28392/38326)
Line Coverage 57.97% (309372/533688)
Region Coverage 54.79% (259109/472885)
Branch Coverage 56.11% (112487/200462)

@eldenmoon eldenmoon force-pushed the branch-doris-26505-master branch from 2c75246 to a967c66 Compare June 18, 2026 13:58
@eldenmoon

Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon

Copy link
Copy Markdown
Member Author

/review

@eldenmoon

Copy link
Copy Markdown
Member Author

run check_coverage

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking findings from this review.

I traced the change through VariantCompactionUtil::get_extended_compaction_schema, the empty-key/root distinction in PathInData, sparse merge reader consumption of PathsSetInfo, and the added BE/regression coverage. The updated helpers consistently keep the empty variant path out of materialized compaction subcolumns and route it to sparse output, including the variant_max_subcolumns_count = 0 + sparse bucket case covered by the new regression. The existing Copilot thread about the stale Apply all paths as subcolumns comment is already known; I did not duplicate it.

Critical checkpoint conclusions:

  • Goal and proof: addresses empty-key data loss during variant compaction; covered by focused BE unit assertions and an end-to-end regression that checks Tags[''] after cumulative compaction.
  • Scope: small, focused, and localized to variant compaction path selection plus tests.
  • Concurrency/lifecycle: no new concurrency, shared mutable state, locks, atomics, or lifecycle/static-init behavior.
  • Config/compatibility/protocol: no new config item, storage format, or FE/BE protocol change; existing variant options are used.
  • Parallel paths: limited top-N selection, unlimited selection, typed paths, subpath materialization, and data-type-driven materialization all handle the empty path consistently.
  • Data correctness/transactions: compaction transaction and visible-version behavior are unchanged; the empty key is preserved by keeping it in sparse output instead of colliding with the root/materialized subcolumn path.
  • Tests/results: added regression output is ordered by the selected value and the BE helper tests cover the changed branches. I did not run the test suite during this review.
  • Observability/performance: no new observability requirement or material performance concern found.
  • BE exec-specific checks: no pipeline dependency, memory reservation/spill, dependency concurrency, or atomic behavior is touched.
  • BE test-specific checks: access-control bypass note is not relevant to these unit changes.
  • User focus: no additional user-provided focus points.

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 100.00% (20/20) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.43% (21365/39252)
Line Coverage 38.07% (204371/536840)
Region Coverage 34.06% (160348/470742)
Branch Coverage 35.06% (70198/200207)

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (20/20) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.05% (28380/38326)
Line Coverage 58.04% (309734/533688)
Region Coverage 54.94% (259791/472885)
Branch Coverage 56.15% (112566/200462)

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29609 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit a967c66b3aea2e5fdc1439dc418dd9b69ab8f1c6, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17811	4318	4266	4266
q2	2009	330	195	195
q3	10276	1446	853	853
q4	4678	471	342	342
q5	7500	874	576	576
q6	196	176	141	141
q7	777	850	605	605
q8	9355	1804	1673	1673
q9	5920	4506	4522	4506
q10	6746	1798	1516	1516
q11	435	279	250	250
q12	621	425	314	314
q13	18184	3414	2843	2843
q14	272	268	255	255
q15	q16	790	784	709	709
q17	986	847	1037	847
q18	7145	5909	5617	5617
q19	1341	1259	1044	1044
q20	505	403	274	274
q21	5867	2586	2472	2472
q22	433	370	311	311
Total cold run time: 101847 ms
Total hot run time: 29609 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4581	4517	4492	4492
q2	345	381	238	238
q3	4688	4993	4409	4409
q4	2083	2165	1394	1394
q5	4430	4352	4339	4339
q6	228	175	132	132
q7	2061	2200	1727	1727
q8	2716	2327	2344	2327
q9	8462	8178	8127	8127
q10	4795	4811	4349	4349
q11	601	464	430	430
q12	776	773	552	552
q13	3392	3646	2972	2972
q14	305	313	282	282
q15	q16	724	754	675	675
q17	1390	1378	1546	1378
q18	7960	7515	7471	7471
q19	1194	1105	1136	1105
q20	2234	2238	1968	1968
q21	5377	4725	4602	4602
q22	516	479	423	423
Total cold run time: 58858 ms
Total hot run time: 53392 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 28796 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit a967c66b3aea2e5fdc1439dc418dd9b69ab8f1c6, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17895	4007	4055	4007
q2	2020	303	182	182
q3	10314	1335	825	825
q4	4683	463	336	336
q5	7490	840	570	570
q6	207	186	142	142
q7	775	833	624	624
q8	9994	1480	1522	1480
q9	6428	4424	4449	4424
q10	6870	1782	1512	1512
q11	431	275	248	248
q12	662	415	304	304
q13	18200	3433	2733	2733
q14	271	262	246	246
q15	q16	794	783	703	703
q17	1042	992	969	969
q18	7036	5667	5566	5566
q19	1404	1369	1118	1118
q20	495	406	266	266
q21	5795	2670	2242	2242
q22	419	362	299	299
Total cold run time: 103225 ms
Total hot run time: 28796 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4545	4409	4370	4370
q2	332	355	229	229
q3	4660	4947	4426	4426
q4	2100	2179	1370	1370
q5	4470	4354	4345	4345
q6	231	176	131	131
q7	1757	2139	1794	1794
q8	2532	2254	2215	2215
q9	8079	8006	7954	7954
q10	4802	4749	4316	4316
q11	627	525	408	408
q12	773	744	537	537
q13	3356	3573	3071	3071
q14	313	289	265	265
q15	q16	714	732	632	632
q17	1370	1396	1380	1380
q18	7866	7292	6865	6865
q19	1145	1123	1103	1103
q20	2223	2209	1953	1953
q21	5335	4785	4484	4484
q22	527	461	412	412
Total cold run time: 57757 ms
Total hot run time: 52260 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 175766 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit a967c66b3aea2e5fdc1439dc418dd9b69ab8f1c6, data reload: false

query5	4318	631	485	485
query6	423	186	166	166
query7	4859	570	312	312
query8	369	212	198	198
query9	8762	4110	4102	4102
query10	439	317	259	259
query11	5924	2338	2125	2125
query12	154	103	106	103
query13	1317	624	455	455
query14	6441	5371	5045	5045
query14_1	4399	4506	4412	4412
query15	204	195	181	181
query16	997	491	440	440
query17	1133	717	587	587
query18	2738	474	353	353
query19	224	189	152	152
query20	117	112	105	105
query21	221	141	120	120
query22	13585	13645	13435	13435
query23	17311	16424	16031	16031
query23_1	16356	16227	16202	16202
query24	7593	1773	1308	1308
query24_1	1333	1331	1331	1331
query25	575	461	397	397
query26	1313	326	168	168
query27	2690	540	330	330
query28	4448	2050	2054	2050
query29	1083	627	523	523
query30	318	238	197	197
query31	1110	1076	958	958
query32	110	63	60	60
query33	531	330	263	263
query34	1186	1171	686	686
query35	766	854	692	692
query36	1446	1463	1330	1330
query37	168	123	85	85
query38	3268	3197	3110	3110
query39	953	943	924	924
query39_1	905	865	887	865
query40	215	118	95	95
query41	63	60	59	59
query42	89	92	99	92
query43	321	330	273	273
query44	1494	784	783	783
query45	195	184	177	177
query46	1079	1205	710	710
query47	2329	2300	2261	2261
query48	420	402	261	261
query49	623	484	342	342
query50	970	371	265	265
query51	4335	4329	4349	4329
query52	85	86	75	75
query53	238	269	189	189
query54	254	209	195	195
query55	78	75	72	72
query56	224	224	217	217
query57	1412	1442	1298	1298
query58	241	206	209	206
query59	1564	1621	1418	1418
query60	273	239	233	233
query61	146	142	152	142
query62	689	657	588	588
query63	231	193	193	193
query64	2487	756	620	620
query65	4880	4815	4801	4801
query66	1727	443	336	336
query67	29830	29653	29580	29580
query68	3270	1652	969	969
query69	410	303	261	261
query70	1078	999	986	986
query71	293	235	213	213
query72	2931	2672	2328	2328
query73	843	816	450	450
query74	5085	5013	4733	4733
query75	2608	2696	2217	2217
query76	2311	1174	809	809
query77	356	376	288	288
query78	12369	12518	11853	11853
query79	2453	1197	798	798
query80	1715	510	384	384
query81	528	276	238	238
query82	603	165	120	120
query83	314	266	251	251
query84	253	145	115	115
query85	883	517	421	421
query86	436	310	290	290
query87	3384	3331	3187	3187
query88	3681	2780	2778	2778
query89	415	375	337	337
query90	1937	183	180	180
query91	174	159	131	131
query92	64	62	58	58
query93	1678	1456	856	856
query94	702	331	303	303
query95	667	384	427	384
query96	1066	816	343	343
query97	2725	2680	2577	2577
query98	212	207	199	199
query99	1160	1170	1049	1049
Total cold run time: 263594 ms
Total hot run time: 175766 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.29 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit a967c66b3aea2e5fdc1439dc418dd9b69ab8f1c6, data reload: false

query1	0.01	0.01	0.01
query2	0.11	0.05	0.04
query3	0.26	0.15	0.14
query4	1.61	0.14	0.14
query5	0.26	0.23	0.22
query6	1.18	1.08	1.09
query7	0.04	0.01	0.00
query8	0.10	0.04	0.03
query9	0.39	0.31	0.33
query10	0.57	0.57	0.56
query11	0.19	0.14	0.15
query12	0.20	0.14	0.15
query13	0.48	0.49	0.49
query14	1.01	1.02	1.01
query15	0.62	0.59	0.58
query16	0.31	0.33	0.33
query17	1.14	1.16	1.06
query18	0.22	0.21	0.22
query19	2.05	1.97	1.90
query20	0.02	0.01	0.02
query21	15.43	0.22	0.14
query22	4.80	0.05	0.06
query23	16.13	0.31	0.12
query24	2.85	0.43	0.32
query25	0.12	0.05	0.05
query26	0.72	0.21	0.15
query27	0.05	0.04	0.04
query28	3.54	0.90	0.54
query29	12.47	4.35	3.49
query30	0.27	0.14	0.16
query31	2.77	0.62	0.31
query32	3.22	0.60	0.49
query33	3.19	3.22	3.28
query34	15.57	4.23	3.53
query35	3.54	3.58	3.53
query36	0.54	0.45	0.41
query37	0.10	0.07	0.06
query38	0.05	0.04	0.04
query39	0.04	0.04	0.03
query40	0.18	0.16	0.14
query41	0.09	0.03	0.03
query42	0.04	0.03	0.03
query43	0.04	0.03	0.04
Total cold run time: 96.52 s
Total hot run time: 25.29 s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants