Skip to content

fix(relay): multi-pod subscription coherence (one access-gated fan-out path + cross-pod cache invalidation + REQ/COUNT DB guard)#1261

Merged
tlongwell-block merged 4 commits into
mainfrom
relay-multipod-coherence
Jun 25, 2026
Merged

fix(relay): multi-pod subscription coherence (one access-gated fan-out path + cross-pod cache invalidation + REQ/COUNT DB guard)#1261
tlongwell-block merged 4 commits into
mainfrom
relay-multipod-coherence

Conversation

@tlongwell-block

Copy link
Copy Markdown
Collaborator

Relay multi-pod subscription coherence

After scaling the relay 1→2 pods, agents intermittently weren't subscribed to rooms they'd just been added to (stopping/respawning fixed it). Root cause: each pod keeps in-memory (moka) membership / accessible-channels / visibility caches with only a 10s TTL and no cross-pod invalidation. A membership change applied on the writer pod left the other pod serving stale is_member / accessible-channels for up to the TTL — denying valid subscriptions and returning empty create-channel readbacks.

Design doc: RESEARCH/RELAY_MULTIPOD_SUBSCRIPTION_GAP.md. One PR, three commits, in order.

Commit 1 — invariant: one access-gated EVENT fan-out path (53ee54e)

Establishes the load-bearing invariant for the whole fix: no relay-local channel-scoped EVENT delivery can bypass filter_fanout_by_access. A single helper (fan_out_event_to_local_subscribers) does fan_out → filter_fanout_by_access → serialize → send loop → drop-count. 8 raw callsites routed through it (net −61 lines: consolidated duplicate loops).

Audit of every production fan_out → send_to(EVENT) callsite (base 8568159):

# callsite channel_id kind(s) before action
1 event.rs dispatch_persistent_event Some/None persistent GATED ✓ left (extra per-recipient DM-visibility gate)
2 event.rs fan_out_pubsub_event Some/None persistent GATED ✓ left (skips local echoes)
3 event.rs:600 ephemeral channel Some ephemeral BYPASS ✗ routed thru helper
4 event.rs:637 ephemeral global None 24134 global no-op routed thru helper (uniformity)
5 event.rs:829 agent-observer None 24200 global no-op routed thru helper
6 audio/handler.rs:793 Some audio lifecycle BYPASS ✗ routed thru helper
7 side_effects.rs:608 membership notif None custom #p global no-op routed thru helper
8 side_effects.rs:2153 ref-state None 30618 global no-op routed thru helper
9 side_effects.rs:2242 NIP-43 announce None 8000 #p global no-op routed thru helper
10 mesh_signaling.rs:319 call-me-now None 24622 #p global no-op routed thru helper
11 transport.rs:1116 ref-state git None 30618 global no-op routed thru helper

req.rs historical delivery is REQ-response, not live fan-out — already gated by accessible_channels; handled by commit #3, not here. Safety: AUTHOR_ONLY_KINDS = [30300] only; none of the global callsite kinds are 30300, so routing them through is a true no-op today (verified buzz-core/src/kind.rs).

Commit 2 — coherence: cross-pod cache-key invalidation over Redis pub/sub (eb2b1ed)

A dedicated buzz:cache-invalidate Redis topic carries each cache-key drop to every pod immediately. The message is a pure cache-key drop, never an "evict these subscriptions" payload — the commit-1 access gate is the universal delivery-enforcement point, so dropping the stale key is sufficient (next read re-fetches authoritative DB state).

  • buzz-pubsub: CacheInvalidation enum (one variant per invalidate_* op), publish_cache_invalidation, subscribe_cache_invalidations, reconnecting subscriber loop mirroring run_subscriber.
  • state.rs: each public invalidate_* does the local moka drop and fire-and-forget spawns the matching publish — all ~13 call sites untouched. The cross-pod consumer calls private *_local drop variants via apply_cache_invalidation, so a received drop is never re-published (no fan-out loop).
  • main.rs: spawn the subscriber + a consumer loop mirroring the multi-node event fan-out consumer.

A missed publish degrades to the ≤10s TTL wait, backstopped by commit #3 — never a leak.

Commit 3 — guard: denial-path DB confirmation in REQ/COUNT (3011abd)

The backstop for the brief window before a TTL expires or an invalidation lands. accessible_channels is a per-request Vec built once from the cache and reused for subscription registration, historical delivery, search scope, and COUNT. On a cache-negative for the targeted channel, confirm membership against the DB uncached; on a verified positive, push ch_id into the Vec via the pure helper resolve_request_local_access. The confirmation is request-local-authoritative: one repair, and registration + historical + COUNT all see it — a stale negative can't stay sticky for the rest of the request.

Truth table (unit-tested, all three cases):

  • cache contains ch_id → allowed, no DB, no repair
  • cache-miss + DB member → allowed, ch_id pushed (repair)
  • cache-miss + DB non-member → denied, vector unchanged

Mirrored in count.rs through the same shared helper.

Verification

  • cargo build --workspace clean; cargo clippy -p buzz-pubsub -p buzz-relay clean.
  • cargo test -p buzz-relay -- --test-threads=1367 passed (364 baseline + 3 new helper tests).
  • cargo test -p buzz-pubsub -- --ignored against live Redis → 6 passed, including test_cache_invalidation_roundtrip (publish on one manager, receive the exact CacheInvalidation on another's subscriber — actual cross-pod propagation).
  • DB truth the guard relies on (is_member after add_member) is covered by existing #[sqlx::test] membership tests in buzz-db/channel.rs.

Notes / out of scope

  • End-to-end handler harness: this crate has no AppState+DB test harness (all relay tests are in-memory; AppState::new needs Db+Redis+audit+pubsub+auth+search+workflow+keypair+S3 GitStore and spawns workers). Standing one up was disproportionate to a ~30-line diff and a fragile harness is its own risk. The request-local-repair invariant is proven by the pure-helper tests; DB truth by the existing buzz-db tests. (Reviewed with Perci.)
  • fix(channels): poll relay read-back after create/update to fix metadata race #1255 noted out of scope.
  • Pre-existing test-isolation debt (not introduced here): the 3 fanout_access tests can fail intermittently under a parallel run because config.rs tests set_var bogus BUZZ_BIND_ADDR/BUZZ_GIT_REPO_PATH that a concurrent Config::from_env() reads. Full suite passes single-threaded (367/367) and the scoped fanout tests pass parallel. Flagging, not fixing, in this PR.

npub1qyvc0c5kl4gqv2fd97fsk46tu378sqgy35vc83rvgfwne90sel7s0ed67d and others added 2 commits June 24, 2026 20:25
…d path

Establish the invariant that no relay-local EVENT delivery can bypass the
membership/access gate: a registered subscription is never sufficient for
delivery — delivery always revalidates access on the sending pod.

Introduce fan_out_event_to_local_subscribers(state, stored), which composes
fan_out() -> filter_fanout_by_access() -> send EVENT frames, and route every
live local fan-out callsite through it:

- ephemeral channel events (handle_ephemeral, was an ungated bypass)
- ephemeral channel-less / global events
- agent-observer frames (kind 24200)
- audio lifecycle events (was an ungated bypass)
- membership-notification, ref-state (30618), and NIP-43 (8000) side effects
- mesh call-me-now (24622)
- git push ref-state (30618) in the HTTP transport

The two previously-gated paths keep their inline filter call rather than the
helper: dispatch_persistent_event layers a per-recipient DM-visibility-owner
gate on top of the shared filter, and fan_out_pubsub_event additionally skips
local echoes. Both are equivalent to the helper plus their own extra step.

The ephemeral and audio paths were genuine pre-existing access-gate holes even
single-pod: a subscription surviving an open->private flip or a membership
removal could receive private-channel events. The global callsites are no-ops
through the gate today (filter_fanout_by_access only applies the author-only-
kind gate when channel_id is None, and none of these are author-only kinds),
routed through it so the single send path stays the universal enforcement point
and future channel-scoped paths cannot bypass it by accident.

filter_fanout_by_access and the new helper take &AppState rather than
&Arc<AppState> so the audio handler (which holds &AppState) can call the
helper; existing &Arc<AppState> callers deref-coerce with no other change.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Each pod keeps in-memory (moka) membership / accessible-channels /
visibility caches dropped only on the pod that processed a write; other
pods relied on the 10s TTL to expire stale entries. After scaling 1->2
pods this surfaced as agents intermittently not subscribed to rooms they
were just added to (stale is_member / accessible-channels on the
non-writer pod).

Carry the same key drops to every pod immediately over a dedicated
'buzz:cache-invalidate' Redis pub/sub topic:

- buzz-pubsub: CacheInvalidation enum (one variant per invalidate_* op),
  publish_cache_invalidation, subscribe_cache_invalidations, and a
  reconnecting subscriber loop mirroring run_subscriber.
- state.rs: each public invalidate_* now does the local moka drop AND
  fire-and-forget spawns the matching publish, so all ~13 call sites stay
  untouched. The cross-pod consumer calls private *_local drop variants
  via apply_cache_invalidation, so a received drop is never re-published
  (no fan-out loop).
- main.rs: spawn the subscriber and a consumer loop mirroring the
  multi-node event fan-out consumer.

The message is a pure cache-key drop, never an 'evict these
subscriptions' payload: the per-event access gate from commit #1 is the
universal delivery-enforcement point, so dropping the stale key is
sufficient (next read re-fetches authoritative state from the DB). A
missed publish degrades to the <=10s TTL wait, backstopped by the REQ
denial-path DB confirmation in commit #3 -- never a leak.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
@tlongwell-block tlongwell-block force-pushed the relay-multipod-coherence branch from 3011abd to d3be43d Compare June 25, 2026 00:56
…n REQ/COUNT

A REQ or COUNT targeting a specific channel gates on `accessible_channels`,
a per-request Vec built once from the 10s membership cache. On a multi-pod
relay this Vec can be stale on a non-writer pod: a member just added on the
pod that processed the write sees a cache-negative until the TTL expires or
the cross-pod invalidation (commit #2) lands. That manifested as the
create-channel readback coming back empty and agents not subscribed to
rooms they were just added to.

On a cache-negative, confirm membership against the DB uncached. On a
verified positive, repair the request-local Vec by pushing `ch_id` once,
via the pure helper `resolve_request_local_access`. The same Vec gates
subscription registration, historical delivery, search scope, and COUNT —
repairing it once makes all of them see the confirmed membership, not just
the denial branch. A stale negative can no longer stay sticky for the rest
of the request.

The repair runs UP FRONT in req.rs, right after the subscription channel_id
is extracted and before the NIP-50 search early-return — not in a late
denial branch. A search scoped to `#h=<just-added>` would otherwise be
scoped against the stale vector and false-miss; running the repair first
means `handle_search_req` sees the repaired vector too.

The helper takes a `token_allows` upper bound so a DB-positive can never
push a channel back in past a narrower scoped token: a token scoped to
channel A must not reach channel B merely because the user is a DB member
of B. Both call sites compute it from the token's `channel_ids`.

- req.rs: `resolve_request_local_access(&mut Vec, ch_id, token_allows, Option<bool>)`
  encodes the truth table (token-denies → denied no DB; cache-hit → allowed
  no DB; miss+DB-true → allowed & pushed; miss+DB-false → denied &
  unchanged) with unit tests for all four. The handler does the async
  `db.is_member` lookup only on a token-allowed miss.
- count.rs: mirrors the same flow through the shared helper, gated by the
  same token bound, and applies the scoped-token `retain` that REQ does
  (after `get_accessible_channel_ids_cached`) so a scoped token cannot COUNT
  out-of-scope channels via the no-channel-filter SQL pushdown either.

DB truth (`is_member` after `add_member`) is covered by the existing
#[sqlx::test] membership tests in buzz-db/channel.rs. End-to-end handler
coverage is noted in the PR: this crate has no AppState+DB test harness and
standing one up was out of scope; the request-local repair invariant is
proven by the pure-helper tests instead.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
@tlongwell-block tlongwell-block force-pushed the relay-multipod-coherence branch from d3be43d to 2241648 Compare June 25, 2026 01:06
Co-authored-by: npub1mprnacetjua2xx3p5eddmhxyk6wv929ymm5py8kd2xfxurxahspqqlgyta <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub1mprnacetjua2xx3p5eddmhxyk6wv929ymm5py8kd2xfxurxahspqqlgyta <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
@tlongwell-block tlongwell-block merged commit 6284454 into main Jun 25, 2026
30 checks passed
@tlongwell-block tlongwell-block deleted the relay-multipod-coherence branch June 25, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant