[paimon] Add bitmap global index#8276
Conversation
47a0532 to
9d2ca55
Compare
9d2ca55 to
9f23064
Compare
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the update. I reviewed the bitmap global index storage format and query paths in detail.
The file layout looks sound to me: the fixed footer carries magic/version and block offsets, null/non-null row sets are separate Roaring64 bitmaps, value bitmaps are addressed through dictionary entries, and the dictionary block index keeps point lookup lazy. The newly added manifest-level min/max/null metadata is also a good optimization because it prunes non-candidate index files before opening the bitmap file, and the fallback scan budget is now applied only to selected files.
I also checked the correctness-sensitive paths:
- equality / IN use serialized-key ordering consistently for dictionary block lookup;
- range predicates deserialize keys and use the logical comparator instead of serialized byte order;
- NOT / NOT IN complement only against each file's non-null bitmap, which keeps SQL null semantics and multi-file unions correct;
- relative row ids are preserved in the index file and offset by the outer range reader;
- string prefix lookup is safe with the current string key serializer, which stores raw UTF-8 bytes.
I ran the focused tests locally on the latest head, plus a temporary numeric negative/multi-dictionary-block equality/range check:
mvn -pl paimon-common -Pfast-build -DfailIfNoTests=false -Dtest=BitmapGlobalIndexReaderTest testmvn -pl paimon-core -am -Pfast-build -DfailIfNoTests=false -Dtest=BitmapGlobalIndexTableTest test
Both passed. Non-blocking thought: for high-cardinality columns each distinct value still becomes one bitmap block plus one dictionary entry, so the docs' guidance to prefer BTree for high-cardinality/range-heavy workloads is important. With the default shard size this is acceptable for the intended enum/tag style use case.
2eac3e8 to
78f1b46
Compare
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the update. I re-reviewed the latest head (78f1b46527263b280fa63338522deb8746832cf7) with focus on the bitmap storage format and the shared sorted-index abstractions.
The bitmap file layout still looks sound to me: the footer keeps magic/version and block offsets, null/non-null row bitmaps are stored separately, value bitmaps are referenced through dictionary entries, and the dictionary block index keeps point lookup lazy. The refactor to SortedIndexFileMeta, SortedFileMetaSelector, and ParallelFileGlobalIndexReader also looks reasonable and preserves the previous BTree behavior/compatibility from what I checked.
I also ran the focused common tests locally (BitmapGlobalIndexReaderTest, SortedIndexFileMetaTest, BTreeFileMetaSelectorTest, BTreeThreadSafetyTest, and LazyFilteredBTreeIndexReaderTest), and they passed.
However, the current CI is failing on Checkstyle:
src/test/java/org/apache/paimon/globalindex/btree/BTreeGlobalIndexBuilderTest.java:[30] (imports) ImportOrder: Import org.apache.paimon.globalindex.KeySerializer appears after other imports that it should precede
Please fix the import ordering in BTreeGlobalIndexBuilderTest so the PR can go green.
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the update. I pushed a small follow-up commit to fix the Checkstyle import ordering issue in BTreeGlobalIndexBuilderTest.
I re-checked the latest head and the only change after my previous review is the import reorder, so the previous storage-format review still stands: the bitmap file layout and the shared sorted-index abstractions look good to me.
3c9f655 to
e6a6427
Compare
e6a6427 to
4de9f22
Compare
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the update. I re-reviewed the latest head (2bce2202fdba0abc40a8d7f9add2b5a9c4fe6dd5).
The new split between SortedFileGlobalIndexReader, SortedFileMetaSelector, BitmapIndexReader, and LazyFilteredBitmapReader looks good to me. The bitmap storage layout is still sound, and the new compression only applies to dictionary/index blocks with per-block trailer/CRC, while the row bitmaps remain directly addressable. Manifest pruning, null semantics, fallback-scan budget, and multi-file complement behavior also look covered.
I ran focused tests locally:
mvn -pl paimon-common -Pfast-build -DfailIfNoTests=false -Dtest=LazyFilteredBitmapIndexReaderTest,SortedFileMetaSelectorTest,SortedIndexFileMetaTest,LazyFilteredBTreeIndexReaderTest,BTreeThreadSafetyTest test
# Tests run: 122, Failures: 0, Errors: 0, Skipped: 0
mvn -pl paimon-core -am -Pfast-build -DfailIfNoTests=false -Dtest=BitmapGlobalIndexTableTest test
# Tests run: 4, Failures: 0, Errors: 0, Skipped: 0
No further blockers from my side.
Summary
Add a new
bitmapglobal index type for enum/tag-style scalar predicates. The index stores exact Roaring 64-bit row-id bitmaps per value, keeps per-file null/non-null row sets for correct complement semantics across multiple index files, and reuses the same sorted-index pruning/parallel-reading framework as BTree.Changes
GlobalIndexerFactory.value -> RoaringNavigableMap64(relativeRowId)with separatenullRowsandnonNullRowsbitmaps.1with Roaring bitmap blocks, compressible dictionary blocks, a compressible dictionary block index, and a fixed footer containing block handles, value count, version, and magic.bitmap-index.compressionandbitmap-index.compression-level; bitmap value blocks keep Roaring's native serialization.1.KeySerializer,SortedIndexFileMeta,SortedFileMetaSelector,ParallelFileGlobalIndexReader, andSortedFileGlobalIndexReader.IN, null checks, prefix predicates, range predicates,BETWEEN,NOT BETWEEN, and compoundAND/ORpredicates.bitmap-index.fallback-scan-max-sizeandbtree-index.fallback-scan-max-sizeboth defaulting to256 mb.bitmap-index.dictionary-block-sizeto control dictionary block sizing; default is16 kb.=,IN, stringstartsWith/LIKE 'prefix%',!=,NOT IN,IS NULL, andIS NOT NULL.endsWith,contains, generalLIKE, range predicates,BETWEEN, andNOT BETWEEN.= NULL,!= NULL,IN (..., NULL), andNOT IN (..., NULL).Testing
mvn -pl paimon-common spotless:applymvn -pl paimon-common -Pfast-build -DfailIfNoTests=false -Dtest=SortedFileMetaSelectorTest testmvn -pl paimon-common -Pfast-build -DfailIfNoTests=false -Dtest=LazyFilteredBitmapIndexReaderTest testmvn -pl paimon-common -Pfast-build -DfailIfNoTests=false -Dtest=LazyFilteredBTreeIndexReaderTest testmvn -pl paimon-core -am -Pfast-build -DfailIfNoTests=false -Dtest=BitmapGlobalIndexTableTest testgit diff --check~/.nvm/versions/node/v22.22.3/bin/node ./node_modules/.bin/docusaurus build(build generated successfully; existing site-wide/docs/master/concepts/overviewbroken-link warnings remain)