Skip to content

feat(consensus): mock_chain_validation replay build + memIAVL state-sync restore fixes#3663

Open
bdchatham wants to merge 3 commits into
mainfrom
feat/shadow-replay-consensus
Open

feat(consensus): mock_chain_validation replay build + memIAVL state-sync restore fixes#3663
bdchatham wants to merge 3 commits into
mainfrom
feat/shadow-replay-consensus

Conversation

@bdchatham

@bdchatham bdchatham commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Makes the mock_chain_validation replay build a first-class, reviewed capability, plus two prod-reachable memIAVL state-sync restore fixes it depends on. The build replays real chain history (e.g. an AppHash-breaking memIAVL→flatKV migration) as a non-validating node, checked by an out-of-band logical-digest comparator instead of consensus.

Three self-contained commits: feat(consensus) (the policy), fix(memiavl) (the restore fixes), ci(ecr) (the image build).

Production behavior is unchanged

ConsensusPolicy is a build-tag-selected type whose swallowing variant compiles only under the mock tags. The default build returns every error unchanged, so every consensus halt is byte-for-byte preserved — confirmed at link level and by the test suites.

What the replay build relaxes

Under mock_chain_validation the policy swallows the consensus checks a replayed/migrated chain can't reproduce (app-state and validator-set/commit hashes). Peer-supplied block content still HALTS — the transaction merkle root and evidence integrity — so a malicious peer can't poison the audit input. A build-failing test guard enforces that boundary.

Two changes that warrant your explicit sign-off:

  1. ErrLastCommitVerify is swallowed, so buildLastCommitInfo builds best-effort commit info (approximate per-index pairing). LastCommitInfo feeds staking rewards/downtime, never EVM state.
  2. mock_block_validation widened to also swallow ErrUpgradeBeforeTrigger and the state-sync appHash check.

memIAVL restore (prod-reachable, not tag-gated)

Re-bootstrapping from an S3 snapshot exposed two restore bugs, both fixed: an interrupted/re-offered restore crashlooped (restore is now idempotent — stale-tmp cleanup, adopt-an-existing-snapshot, reject non-directories), and Close() leaked the import lock on error (now released on every path).

Review & test

Blinded multi-specialist review with an assigned dissenter, iterated to unanimous RESOLVED; all automated-review threads (Cursor, Codex, seidroid, ai-review) resolved. go build + go test pass under all three configs — default, mock_chain_validation, mock_block_validation — verified by a full sweep.

@cursor

cursor Bot commented Jun 29, 2026

Copy link
Copy Markdown

PR Summary

High Risk
Changes consensus validation behavior under mock build tags and touches commit-info, blocksync, state-sync, and upgrade halt paths; production default policy is pass-through but mistakes in build tags or allowlists could weaken replay safety or leave prod paths altered.

Overview
Introduces a mock_chain_validation replay path where validation failures go through DefaultConsensusPolicy().HandleError so only mock-tagged binaries swallow selected sentinels; default builds still halt on every error.

Consensus / upgrade: New ErrUpgradeBeforeTrigger; upgrade BeginBlocker panics only when the policy returns the error. mock_chain_validation now uses an explicit swallow allowlist (includes ErrLastCommitVerify and upgrade-before-trigger) while peer content checks (ErrDataHash, evidence) still halt; mock_block_validation also swallows upgrade-before-trigger. buildLastCommitInfo no longer always panics on commit/valset size mismatch—it routes through the policy and builds best-effort vote info when swallowed. Blocksync commit verify, state-sync appHash verify, and related sites use the same HandleError pattern.

memIAVL: State-sync import/rewrite paths are idempotent on re-offered snapshots (stale tmp cleanup, adopt existing snapshot-<h>, clear stale current-tmp); MultiTreeImporter.Close always releases the import lock via defer.

Tests / CI: Halt-only tests moved behind //go:build constraints; mock-specific coverage added. ECR mock_chain_validation image drops mock_balances from build tags.

Reviewed by Cursor Bugbot for commit b88231b. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJun 29, 2026, 11:54 PM

@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 42.10526% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.49%. Comparing base (9944128) to head (b88231b).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
sei-db/state_db/sc/memiavl/import.go 26.08% 13 Missing and 4 partials ⚠️
sei-db/state_db/sc/memiavl/db.go 0.00% 10 Missing and 1 partial ⚠️
sei-tendermint/internal/state/execution.go 50.00% 5 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3663      +/-   ##
==========================================
- Coverage   59.15%   58.49%   -0.66%     
==========================================
  Files        2262     2196      -66     
  Lines      187031   180333    -6698     
==========================================
- Hits       110632   105485    -5147     
+ Misses      66455    65474     -981     
+ Partials     9944     9374     -570     
Flag Coverage Δ
sei-chain-pr 74.92% <78.26%> (?)
sei-db 70.41% <ø> (ø)
sei-db-state-db ?
sei-db-state-db-pr 75.25% <17.64%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-cosmos/x/upgrade/abci.go 80.88% <100.00%> (+0.28%) ⬆️
sei-tendermint/internal/blocksync/reactor.go 68.01% <100.00%> (+0.17%) ⬆️
sei-tendermint/internal/statesync/syncer.go 68.40% <100.00%> (+0.37%) ⬆️
sei-tendermint/types/consensus_policy.go 100.00% <100.00%> (ø)
sei-tendermint/internal/state/execution.go 81.79% <50.00%> (-0.12%) ⬇️
sei-db/state_db/sc/memiavl/db.go 65.71% <0.00%> (-0.62%) ⬇️
sei-db/state_db/sc/memiavl/import.go 69.10% <26.08%> (-4.76%) ⬇️

... and 71 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A well-isolated, build-tag-gated ConsensusPolicy that leaves the default/production build byte-for-byte unchanged, plus two prod-reachable memIAVL state-sync restore fixes that look correct. No production blockers; a few shadow-build-scoped notes and a missing unit test are worth addressing.

Findings: 0 blocking | 5 non-blocking | 2 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Missing unit test for buildLastCommitInfo's newly-reachable best-effort path (commit/valset size mismatch under mock_chain_validation). The PR acknowledges this as a reviewer-sanctioned deferred follow-up; it's shadow-tag-only and doesn't touch the EVM oracle, but the bounds-checked loop is exactly the kind of off-by-one that a small table test would lock down.
  • Cursor's second-opinion review file (cursor-review.md) was empty — that pass produced no output. Codex's review was present and its two findings are incorporated below.
  • mock_block_validation relaxation surface is widened by this PR in two places. ErrUpgradeBeforeTrigger is explicitly and intentionally added (documented). However the syncer.go ErrAppHash routing also silently extends mock_block_validation to swallow state-sync snapshot appHash mismatches — an undocumented side effect (see inline comment).
  • 2 suggestion(s)/nit(s) flagged inline on specific lines.

Comment thread sei-tendermint/internal/statesync/syncer.go
Comment thread sei-tendermint/types/consensus_policy_mock_chain_validation_test.go Outdated

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any bugs, but this PR introduces a new build-tag-selected ConsensusPolicy framework, swallows ErrLastCommitVerify under mock_chain_validation, gates buildLastCommitInfo's panic invariant on TolerateLastCommitMismatch(), and routes x/upgrade + state-sync + blocksync through the policy — even with default-build isolation, the consensus-safety surface and the two flagged semantic changes warrant a human review.

Extended reasoning...

Overview

The PR touches 16 files across consensus-critical and state-sync subsystems: a new build-tag-selected ConsensusPolicy framework in sei-tendermint/types (default / mock_block_validation / mock_chain_validation variants), routing call sites in x/upgrade BeginBlocker, blocksync reactor's last-commit verify, statesync syncer's appHash check, and execution.go's buildLastCommitInfo; plus prod-reachable memIAVL state-sync restore fixes (idempotent restore, defer-release of import flock). New sentinel ErrUpgradeBeforeTrigger added.

Security risks

The PR is consensus-relaxing by design under shadow tags. The default-build isolation is the load-bearing safety guarantee: ConsensusPolicy.HandleError returns every error unchanged, TolerateLastCommitMismatch() is false, and the swallow code is not linked into a default binary. Two production-reachable changes still need scrutiny: (1) buildLastCommitInfo no longer unconditionally panics on commit/valset size mismatch — the panic is now gated by a policy call (still always-panics in default, but the invariant moved from a hard guard to a policy contract); (2) the memIAVL restore fixes affect every node doing state-sync, not just shadow builds. No injection / auth / crypto surface changes.

Level of scrutiny

High. This is consensus-critical and security-sensitive code. Even with airtight build-tag isolation, the policy framework introduces a new contract (HandleError + TolerateLastCommitMismatch) that future call sites must use correctly, and the swallow-allowlist semantics under mock_chain_validation determine what a replay-shadow can mask. The author themselves flagged two semantic changes for explicit sign-off and acknowledged a deferred unit test for the newly-reachable best-effort path in buildLastCommitInfo.

Other factors

The PR has thorough test coverage for the policy matrix across all three build tags, a per-sentinel swallow-matrix test, wrapped-error and label-mapping tests, and the memIAVL restore changes are accompanied by file-existence/ENOTEMPTY handling and defer-based lock release. The description notes an independent multi-specialist review converged to unanimous resolution. Only cursor[bot] has commented; no human review yet. Given the scope and the consensus-relaxing semantics, a human consensus/security reviewer should give explicit sign-off.

@bdchatham bdchatham force-pushed the feat/shadow-replay-consensus branch from 165c9e3 to 641b257 Compare June 29, 2026 20:17
@bdchatham

Copy link
Copy Markdown
Contributor Author

Thanks — addressed the non-blocking items in the latest push (force-update to 641b2577d):

  • Missing buildLastCommitInfo test → added TestBuildLastCommitInfo_ToleratesCommitValSetMismatch (build-tagged mock_chain_validation): a commit shorter than the validator set builds best-effort CommitInfo (sized by the valset, present sig applied, absent slots not-signed) without panicking. Locks down the bounds-checked loop.
  • SwallowMatrix comment/messaging nit → the test now asserts HandleError matches the allowlist exactly with an explicit semantic guard that ErrDataHash/ErrEvidenceHash/ErrPerEvidenceValidateBasic are never in it; a not-allowlisted sentinel now fails with "want HALT (not in the swallow allowlist)", aligning the message with the halt-by-default design.
  • mock_block_validation widening → the PR body now explicitly calls out both surfaces: ErrUpgradeBeforeTrigger and the statesync/syncer.go ErrAppHash routing (consistent with mock_block's existing AppHash relaxation, flagged for transparency).

The empty cursor-review.md was a tooling artifact on your side (no output produced), not a finding.

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A well-structured, thoroughly-tested change that makes the build-tag-gated mock_chain_validation replay-shadow policy first-class and lands two prod-reachable memIAVL state-sync restore fixes. Default/production behavior is provably unchanged (zero-sized, tag-selected ConsensusPolicy); the only substantive note is a defensive gap in the idempotent-restore path that adopts an existing snapshot-<h> path without confirming it is a directory.

Findings: 0 blocking | 6 non-blocking | 1 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Inconsistency between the two memIAVL fixes: RewriteSnapshot (db.go) keeps os.Rename and only adopts the existing target on fs.ErrExist/ENOTEMPTY, so renaming onto a regular file (ENOTDIR) still falls through to the error/cleanup path — the safe behavior. Close (import.go) instead stat-checks first and unconditionally adopts any existing path. Consider aligning the two so a non-directory target is treated as an error in both, or at least document why the divergence is intentional.
  • REVIEW_GUIDELINES.md (pulled from the base branch) is empty/blank, so no repo-specific review standards could be applied in this pass.
  • The Cursor second-opinion file (cursor-review.md) is empty — that review pass produced no output. The OpenAI Codex pass did produce output (the import.go IsDir concern), which is incorporated here.
  • Codex noted it could not run tests because the Go 1.25.6 toolchain download was network-blocked in its environment; tests were not independently re-run here either. The PR claims go build/go test pass for default and both mock tags, plus memIAVL package tests — worth confirming CI shows green for all three build configurations.
  • Minor: the deferred buildLastCommitInfo best-effort test the PR description lists as a follow-up appears to already be included (execution_lastcommit_shadow_test.go); the description's 'deferred follow-up' note is now stale and could be updated.
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

Comment thread sei-db/state_db/sc/memiavl/import.go Outdated
Comment thread sei-tendermint/types/consensus_policy_mock_block_validation.go
Comment thread .github/workflows/ecr.yml
@bdchatham bdchatham force-pushed the feat/shadow-replay-consensus branch from 641b257 to 7fe93ad Compare June 29, 2026 21:03

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A well-isolated change that makes the mock_chain_validation replay-shadow build first-class and lands two prod-reachable memIAVL state-sync restore fixes. Default-build behavior is preserved at every routed call site; the only notable finding is a shadow-build-only signature-attribution concern in buildLastCommitInfo that does not affect production or the audit oracle.

Findings: 0 blocking | 5 non-blocking | 1 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • REVIEW_GUIDELINES.md was empty/missing, so no repo-specific review standards could be applied.
  • The Cursor second-opinion review (cursor-review.md) was empty — that pass produced no output.
  • The PR explicitly defers a unit test for buildLastCommitInfo's best-effort path beyond the size-mismatch case (e.g. the by-address attribution behavior). Reviewers classified this low-risk/cuttable; reasonable to track as follow-up.
  • Default-build isolation, the load-bearing safety guarantee, rests on link-level exclusion of the swallow path and the build-tag test suites. Consider a CI check asserting the default binary does not link the swallowing ConsensusPolicy variants, to keep this guarantee from silently regressing.
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

Comment thread sei-tendermint/internal/state/execution.go
Comment thread sei-db/state_db/sc/memiavl/db.go Outdated
Comment thread sei-tendermint/internal/state/execution.go Outdated
Comment thread sei-tendermint/types/consensus_policy.go Outdated
Comment thread sei-tendermint/internal/state/execution.go Outdated
@bdchatham bdchatham force-pushed the feat/shadow-replay-consensus branch from 7fe93ad to 74035bd Compare June 29, 2026 21:30
@bdchatham bdchatham changed the title feat(consensus): mock_chain_validation replay-shadow build + memIAVL state-sync restore fixes feat(consensus): mock_chain_validation replay build + memIAVL state-sync restore fixes Jun 29, 2026
Comment thread sei-db/state_db/sc/memiavl/db.go

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A carefully-scoped change that routes a fixed allowlist of consensus-validation failures through a build-tag-selected ConsensusPolicy (swallowed only under mock_chain_validation/mock_block_validation) plus two prod-reachable memIAVL state-sync restore fixes. The load-bearing guarantee — byte-for-byte unchanged production behavior — holds: every routed default-build call site retains its original halt. No blockers; only minor non-blocking notes.

Findings: 0 blocking | 4 non-blocking | 1 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • cursor-review.md is empty (Cursor produced no output) and codex-review.md reports no material findings; REVIEW_GUIDELINES.md is empty, so no repo-specific standards were applied. Noting per the review process.
  • Behavioral change to the mock_block_validation build: ErrUpgradeBeforeTrigger is now swallowed there too (consensus_policy_mock_block_validation.go). This is intentional per the PR description but widens that tag's relaxation set — worth explicit maintainer sign-off.
  • Good defensive testing: the mock_chain_validation swallow-matrix test asserts ErrDataHash/ErrEvidenceHash/ErrPerEvidenceValidateBasic are never in the allowlist, and the allowlist is halt-by-default for future sentinels.
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

Comment thread sei-tendermint/internal/state/execution.go Outdated
@bdchatham bdchatham force-pushed the feat/shadow-replay-consensus branch from 74035bd to cb5fe23 Compare June 29, 2026 21:35

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A well-scoped, well-tested change that promotes the mock_chain_validation replay build to a reviewed capability and lands two prod-reachable memIAVL state-sync restore fixes; production default-build behavior is preserved (all routed call sites still halt) and the swallow allowlist is halt-by-default with a build-failing guard protecting peer-content integrity checks. No blocking issues found.

Findings: 0 blocking | 3 non-blocking | 0 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Both second-opinion passes produced no findings to merge: the Codex review states "No material findings" (and notes it could not run tests due to a Go 1.24 vs required 1.25.6 mismatch with no network access), and the Cursor review file is empty. REVIEW_GUIDELINES.md is also empty, so no repo-specific standards were applied.
  • memIAVL Close()/RewriteSnapshot adopt an existing snapshot- directory as complete without validating its contents. This is sound given the import flock serialization + atomic-rename invariant (the dir only exists if a prior Close fully renamed it), but it relies entirely on that invariant holding; worth a brief note for future maintainers who might reach these paths outside the flock-protected restore flow.
  • Under mock_chain_validation, ErrTooMuchEvidence is in the swallow allowlist while ErrEvidenceHash/ErrPerEvidenceValidateBasic correctly still halt. Swallowing the evidence-quantity bound is acceptable for a non-validating replay build whose oracle is the out-of-band digest comparator, but it is the one allowlist entry not strictly tied to migration/validator-set drift — confirm this is intentional.

@bdchatham bdchatham force-pushed the feat/shadow-replay-consensus branch from cb5fe23 to 4952ead Compare June 29, 2026 21:58

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4952ead. Configure here.

Comment thread sei-db/state_db/sc/memiavl/import.go

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A carefully scoped, well-tested change that makes the mock_chain_validation replay build first-class and lands two prod-reachable memIAVL state-sync restore fixes. Production behavior is byte-for-byte unchanged (the swallowing ConsensusPolicy compiles only under mock tags), and the memIAVL idempotency/lock-release fixes are correct. No blocking issues found.

Findings: 0 blocking | 3 non-blocking | 0 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Second-opinion passes produced no usable output: Cursor's review file (cursor-review.md) is empty, and Codex (codex-review.md) reported no findings but explicitly could not run the targeted Go tests because the sandbox couldn't download Go 1.25.6. This review therefore rests on the PR's own CI run and manual analysis; consider confirming the green CI run (default + both mock tags) before merge.
  • Behavioral widening worth an explicit sign-off (already called out in the PR description): routing the state-sync appHash check in syncer.go through HandleError means the mock_block_validation tag now also swallows ErrAppHash mismatches during state-sync verifyApp, not just block validation. This is consistent with that tag's existing ErrAppHash relaxation and affects only a mock build, but it is a real expansion of that tag's scope.
  • On the swallow path in statesync/syncer.go, a swallowed appHash mismatch continues silently with no log line (logging only happens on the halting branch). This is intentional per the design comment, but a single debug/info breadcrumb on the swallow path would aid post-mortem of a replay that diverged.

@bdchatham bdchatham force-pushed the feat/shadow-replay-consensus branch 2 times, most recently from 08eaf03 to c9f9b42 Compare June 29, 2026 22:10
seidroid[bot]
seidroid Bot previously requested changes Jun 29, 2026

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mock_chain_validation policy refactor and memIAVL restore fixes are well-structured and keep production behavior unchanged, but routing the state-sync appHash check through the policy leaves the untagged TestSyncer_verifyApp "invalid hash" case failing under both mock tags, and a safety-critical package doc is now stale.

Findings: 1 blocking | 3 non-blocking | 2 posted inline

Blockers

  • None at the file/PR level.
  • 1 blocking issue(s) flagged inline on specific lines.

Non-blocking

  • Cursor's second-opinion review file (cursor-review.md) was empty — that pass produced no output, so its perspective is absent from this synthesis.
  • PR description claims go test passes for default, mock_chain_validation, and mock_block_validation, but the statesync package's TestSyncer_verifyApp would fail under both mock tags (see inline). Either the tagged statesync tests weren't actually run, or the claim should be scoped to compilation only — worth clarifying.
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

Comment thread sei-tendermint/internal/statesync/syncer.go
Comment thread sei-tendermint/types/consensus_policy.go Outdated
Comment thread sei-cosmos/x/upgrade/abci.go Outdated
Comment thread sei-tendermint/internal/blocksync/reactor.go Outdated
Comment thread sei-tendermint/internal/statesync/syncer.go
Comment thread sei-tendermint/types/consensus_policy.go Outdated
@bdchatham bdchatham force-pushed the feat/shadow-replay-consensus branch from c9f9b42 to 1707513 Compare June 29, 2026 23:18

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A well-engineered, thoroughly-tested change that makes the mock_chain_validation replay build a first-class capability and lands two prod-reachable memIAVL state-sync restore fixes; default-build (production) behavior is preserved byte-for-byte at every routed call site, and no correctness or security issues were found. Verdict is neutral only because the two second-opinion passes added no usable signal (Cursor empty; Codex could not run tests) and a couple of minor non-blocking suggestions remain.

Findings: 0 blocking | 5 non-blocking | 1 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Cursor's second-opinion review file (cursor-review.md) is empty — that pass produced no output, so only this review plus Codex's (which also found nothing) cover the change.
  • Codex's review reported "No material findings" but explicitly could NOT run go test: its sandbox has Go 1.24.13 while the module requires 1.25.6 and the toolchain download was blocked. So Codex's pass did not exercise the test suites; rely on CI for the default + both-tag runs.
  • Two deliberate semantic relaxations are called out in the PR body and warrant explicit maintainer sign-off (not code defects): (a) ErrLastCommitVerify is now swallowed under mock_chain_validation, making buildLastCommitInfo build positionally-approximate LastCommitInfo; (b) mock_block_validation now also relaxes ErrUpgradeBeforeTrigger and the state-sync ErrAppHash routing.
  • Minor: the DefaultConsensusPolicy() constructor name is slightly misleading — under mock build tags it returns the swallowing policy, not a "default/production" one. Pre-existing naming, but worth a doc note since this PR adds more call sites.
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

Comment thread sei-db/state_db/sc/memiavl/db.go
@seidroid seidroid Bot dismissed their stale review June 29, 2026 23:23

Superseded: latest AI review found no blocking issues.

@bdchatham bdchatham force-pushed the feat/shadow-replay-consensus branch 2 times, most recently from aad1310 to 47eae02 Compare June 29, 2026 23:40

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR cleanly gates all consensus-relaxing behavior behind build tags — the default (production) build's HandleError provably returns every error unchanged, so prod halting semantics are byte-for-byte preserved, and tests are well-scoped per build. The one substantive finding is a narrow but prod-reachable idempotency gap in the memIAVL restore path (stale current-tmp not cleared), confirming Codex's P1.

Findings: 0 blocking | 5 non-blocking | 1 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Restore idempotency is still incomplete for the symlink swap step (Codex P1, valid): updateCurrentSymlink (db.go:1247) creates current-tmp via os.Symlink and never clears a stale one. If a prior restore crashed after creating current-tmp but before the atomic rename to current, a re-offer now adopts the existing snapshot-<h> dir and then fails in updateCurrentSymlink with EEXIST — re-introducing a crashloop the PR aims to eliminate. Window is narrow, but the path is prod-reachable (not tag-gated). Suggest os.Remove(currentTmpPath(dir)) before the symlink, or clearing it alongside the tmp-dir cleanup on entry. Affects both import.go Close() and db.go RewriteSnapshot adopt arms.
  • Cursor second-opinion review (cursor-review.md) produced no output; only Codex's single P1 was available to merge.
  • mock_block_validation's swallow set is widened to include ErrUpgradeBeforeTrigger, changing that existing tag's behavior. Intentional and documented, and it does not affect the default build, but worth an explicit reviewer sign-off since it alters an established build's semantics.
  • Confirmed no production behavior change: default ConsensusPolicy.HandleError returns err unchanged, and buildLastCommitInfo is behavior-identical when commit size matches the validator set (the normal path), so the rewrite only diverges under mock builds.
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

Comment thread sei-db/state_db/sc/memiavl/import.go
Comment thread sei-tendermint/types/consensus_policy.go Outdated
@bdchatham bdchatham force-pushed the feat/shadow-replay-consensus branch from 47eae02 to 17a6e19 Compare June 29, 2026 23:51
…ay builds

Adds a build-tag-gated ConsensusPolicy ("mock_chain_validation") so a
non-validating node can replay real chain history despite the consensus
divergence inherent to replaying an AppHash-breaking storage migration
(memIAVL->flatKV) against a validator set it cannot reproduce bit-for-bit.

The default build is unchanged: DefaultConsensusPolicy().HandleError preserves
every panic/halt, verified byte-identical.

Under the mock_chain_validation tag the policy swallows an explicit ALLOWLIST of
the sentinels that drift for migration / validator-set / commit reasons. Peer-
supplied block-content integrity still HALTS -- ErrDataHash (the tx merkle root
feeding the EVM execution this build compares), ErrEvidenceHash,
ErrPerEvidenceValidateBasic -- so a malformed/lying peer cannot poison the audit
input. Allowlist => a sentinel added later halts by default.

Previously-unguarded checks route through the policy (inert in the default build),
all via the single HandleError(err) verb:
- blocksync/reactor.go: VerifyCommitLight -> policy.HandleError
- state/execution.go: buildLastCommitInfo routes a commit/validator-set size
  mismatch through the policy (default panics; mock_chain_validation builds
  best-effort commit info, which only feeds staking rewards/downtime)
- statesync/syncer.go: state-sync snapshot appHash check -> policy (also extends
  mock_block_validation, which already relaxes ErrAppHash)
- x/upgrade/abci.go: binary-updated-before-trigger panic -> policy

Tests: SwallowMatrix pins the allowlist + the never-swallow content-integrity
guard; TestBlockValidateBasic probes the active policy; a build-tagged test pins
buildLastCommitInfo's best-effort path (commit/valset size mismatch, no panic).
…every path

Re-bootstrapping a node from an S3 state-sync snapshot exposed two issues:
restore re-entry after an interrupted import crashlooped (now stat-first
idempotent -- clears the stale tmp dir on entry, adopts an already-completed
snapshot-<h>, and rejects a non-directory at that path), and Close released the
import flock only on the success path, so an error leaked it and a same-process
re-offer failed with ErrLocked (Close now releases via defer on every path). The
import flock serializes restore against any rewrite, so the existing-dir arms are
defensive idempotency, not race handling.
The mock_chain_validation image backs the migrate_evm replay node, which must
execute real chain history faithfully; mock_balances stubs balances (and panics
on pacific-1), corrupting execution. The standalone mock_balances image
(mock-<tag>) remains for benchmark chains that submit synthetic transactions.
@bdchatham bdchatham force-pushed the feat/shadow-replay-consensus branch from 17a6e19 to b88231b Compare June 29, 2026 23:53

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A carefully-scoped PR that makes the build-tag-gated mock_chain_validation replay policy first-class and lands two prod-reachable memIAVL state-sync restore fixes. Production (default-build) behavior is unchanged at the link level and the changes are well-tested across all three build configs; no blocking issues found.

Findings: 0 blocking | 4 non-blocking | 0 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Second-opinion passes produced no actionable output: the Codex review reported "No material findings" (and noted it could not run the Go tests due to a Go 1.24 vs required 1.25.6 toolchain mismatch in its sandbox), and the Cursor review file was empty. Neither surfaced findings to merge.
  • buildLastCommitInfo's best-effort path under mock_chain_validation (swallowing ErrLastCommitVerify) produces positionally-approximate Signatures↔Validators pairing. This is documented and only affects the non-production replay build (LastCommitInfo feeds staking rewards/downtime, not EVM state), but it is a genuine semantic relaxation worth explicit reviewer sign-off as the PR itself flags.
  • The mock_block_validation swallow set is widened (ErrUpgradeBeforeTrigger + the state-sync ErrAppHash routing). Only affects mock builds, but it changes that tag's long-standing behavior — confirm this widening is intended for all consumers of mock_block_validation images.
  • In RewriteSnapshot/Close adoption arms, an already-present snapshot- directory is adopted without validating its contents; this is safe given os.Rename atomicity and import-flock serialization (a snapshot- dir only exists from a completed atomic rename), but relies on those invariants holding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant