Skip to content

[spark] Record the write operation type in snapshot properties#8236

Open
Zouxxyy wants to merge 4 commits into
apache:masterfrom
Zouxxyy:xinyu/paimon-operation
Open

[spark] Record the write operation type in snapshot properties#8236
Zouxxyy wants to merge 4 commits into
apache:masterfrom
Zouxxyy:xinyu/paimon-operation

Conversation

@Zouxxyy

@Zouxxyy Zouxxyy commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Purpose

A Paimon snapshot only records the physical CommitKind (APPEND/COMPACT/OVERWRITE/...), not the logical operation that produced it — so an APPEND from INSERT INTO cannot be told apart from one produced by MERGE INTO.

This PR records the logical operation type in the snapshot properties map under the key operation. No format change — Snapshot already has a properties: Map<String, String> field.

Core: add InnerTableCommit#withCommitProperties(...), applied in TableCommitImpl so the properties land on every snapshot the commit generates (both the append and overwrite paths, since FileStoreCommitImpl sources snapshot properties from committable.properties()).

Spark (both v1 and v2 write paths):

SQL operation
INSERT INTO WRITE
INSERT OVERWRITE OVERWRITE
DELETE DELETE
UPDATE UPDATE
MERGE INTO MERGE
CREATE TABLE AS SELECT CREATE TABLE AS SELECT
(CREATE OR) REPLACE TABLE AS SELECT REPLACE TABLE AS SELECT / CREATE OR REPLACE TABLE AS SELECT

Tests

Added SnapshotOperationTest (paimon-spark-ut) asserting the recorded operation for INSERT/OVERWRITE/UPDATE/DELETE/MERGE under both spark.paimon.write.use-v2-write=true and false, plus CTAS/RTAS.

Comment thread paimon-core/src/main/java/org/apache/paimon/table/sink/BatchTableCommit.java Outdated
…ELETE

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JingsongLi

Copy link
Copy Markdown
Contributor

I think adding operation as a dedicated nullable field in Snapshot is a better direction than storing it in properties.
The parsing overhead should be negligible. Snapshot metadata is already read and deserialized as JSON, so one additional nullable string/enum field will not have meaningful performance impact compared with filesystem IO and manifest planning. With @JsonInclude(NON_NULL), old snapshots and snapshots without operation will not carry extra JSON size either.

Compatibility should also be fine:

  • Old snapshots do not have this field, so the new reader can treat it as null.
  • Older readers should ignore the new field because Snapshot already uses @JsonIgnoreProperties(ignoreUnknown = true).

I would suggest modeling it as a first-class nullable enum or string field, for example Snapshot.Operation, rather than putting it into properties. commitKind describes the physical snapshot change, while operation describes the logical user operation, so both feel like core snapshot metadata.

This would also avoid introducing a generic withCommitProperties API just for one standard field, and avoids potential conflicts around the "operation" property key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants