Skip to content

fix(querier): de-flake TestQuerierWithBlocksStorageOnMissingBlocksFromStorage by waiting for ACTIVE store-gateway in querier ring view#7615

Open
sandy2008 wants to merge 2 commits into
cortexproject:masterfrom
sandy2008:fix/flaky-querier-missing-blocks-sg-active-wait
Open

fix(querier): de-flake TestQuerierWithBlocksStorageOnMissingBlocksFromStorage by waiting for ACTIVE store-gateway in querier ring view#7615
sandy2008 wants to merge 2 commits into
cortexproject:masterfrom
sandy2008:fix/flaky-querier-missing-blocks-sg-active-wait

Conversation

@sandy2008

Copy link
Copy Markdown
Contributor

What this PR does

De-flakes TestQuerierWithBlocksStorageOnMissingBlocksFromStorage (integration_querier, observed on arm64): the first happy-path query — before any blocks are deleted — intermittently failed with server_error: server error: 500 at querier_test.go:367. This is a test-only change: one additional readiness wait plus a CHANGELOG.md line. No production code is touched, and the test's deliberate post-deletion 500 assertions are unchanged.

Root cause (decoded from the CI logs)

The querier logs the 500 response body gzipped inside the Response: field of its request log line. Unescaping the Go-quoted bytes and gunzipping them yields the same body in both failing runs (modulo block ULID):

{"status":"error","errorType":"server_error","error":"expanding series: failed to get store-gateway replication set owning the block 01KT4J55J7SX01JD4N5RD8F3WP: at least 1 healthy replica required, could only find 0 - unhealthy instances: 172.18.0.8:9095"}
  • Run 26832378915 (2026-06-02): 500 at 16:20:11.513, server latency 5.19ms, block 01KT4J55J7SX01JD4N5RD8F3WP.
  • Run 26632776611 (2026-05-30): 500 at 01:15:25.864, server latency 4.13ms, block 01KSV76AWZPKY9H51G4JCGTRYX.

The 4–5ms latency means the error is querier-local: the query never reached the store-gateway. The mechanism:

  1. The store-gateway registers in the ring with all 512 tokens in JOINING state (pkg/storegateway/gateway.go:461, OnRingInstanceRegister), runs InitialSync (gateway.go:325 — this is what drives cortex_bucket_store_blocks_loaded to 1), and only then CASes itself to ACTIVE (gateway.go:333).
  2. All three of the test's readiness waits are satisfiable while the querier still sees the store-gateway in JOINING: ring tokens are registered (and counted by cortex_ring_tokens_total) already at JOINING, and the other two waits are store-gateway-local metrics.
  3. The querier's BlocksRead ring operation admits ACTIVE instances only (pkg/storegateway/gateway_ring.go:49-55); with IgnoreUnhealthyInstancesReplicationStrategy (pkg/querier/blocks_store_queryable.go:236) and the single store-gateway this test runs, a JOINING instance yields exactly at least 1 healthy replica required, could only find 0 - unhealthy instances: ... (pkg/ring/replication_strategy.go:93), wrapped at pkg/querier/blocks_store_replicated_set.go:127 into the decoded message above → HTTP 500.
  4. The race window is structural, not just "slow CI": the querier's consul WatchKey poll is rate-limited to 1 req/s by default (pkg/ring/kv/consul/client.go:79), so after consuming the JOINING registration update the querier learns about the ACTIVE flip ≥ ~1s later (longer under arm64 runner contention), while the test's three metric waits can all pass on their first poll (300–600ms backoff).

This is not a production bug: JOINING exists precisely to keep BlocksRead away from store-gateways that have not completed their initial blocks sync. The gap is purely in the test's readiness conditions — they assert producer-side (store-gateway) progress but never the consumer-side state (the querier's ring view) that the query path actually consults.

Correcting the original triage in #7605

The ResourceExhausted ... exceeded series limit ... limit 1 violated (got 2) store-gateway log lines quoted in the issue were misattributed: they appear at 16:20:23 — after this test had already failed (16:20:11) — and belong to the next test, TestQuerierWithBlocksStorageLimits, which deliberately sets -querier.max-fetched-series-per-query=1 (and passed). This test sets no series limit. Two more negatives ruled out while decoding: the "reuse grpc buffer in querier's store-gateway stream" change is unrelated (the failing query never reaches the store-gateway stream), and cortex_bucket_store_blocks_loaded == 1 is the correct expected count (cortex_ingester_shipper_uploads_total == 1 is asserted earlier in the test).

The fix

One wait inserted after the existing ring/blocks waits and before the first query, so the test waits until the querier's own view of the store-gateway ring shows 1 ACTIVE member:

	// Wait until the querier observes the store-gateway as ACTIVE in its view of the store-gateway
	// ring: the store-gateway registers as JOINING and switches to ACTIVE only after the initial
	// blocks sync, so the waits above can all pass while queries would still fail with
	// "at least 1 healthy replica required, could only find 0" (500). Keep after the tokens wait.
	require.NoError(t, querier.WaitSumMetricsWithOptions(e2e.Equals(1), []string{"cortex_ring_members"}, e2e.WithLabelMatchers(
		labels.MustNewMatcher(labels.MatchEqual, "name", "store-gateway-client"),
		labels.MustNewMatcher(labels.MatchEqual, "state", "ACTIVE"),
	)))

Why this deterministically closes the window rather than just shrinking it:

  • The awaited cortex_ring_members{name="store-gateway-client",state="ACTIVE"} gauge is computed by updateRingMetrics from the same mutex-guarded ringDesc that GetClientsFor → ring.Get → Filter consults (pkg/ring/ring.go), so the wait condition is the exact negation of the failure condition.
  • ACTIVE is monotonic within this test (no store-gateway restarts/leaves), and the heartbeat period (15s) is far below the 1m heartbeat timeout, so the instance cannot become unhealthy between the wait and the query.
  • The wait sits after the cortex_ring_tokens_total == 512*2 wait, which guarantees the querier has already processed a non-empty store-gateway ring snapshot; updateRingMetrics zero-fills the per-state series (including ACTIVE) on that first snapshot (pkg/ring/ring.go:670-681), so the bare matcher form (no e2e.WaitMissingMetrics) cannot hit "metric not found" here — and if the waits are ever reordered, it fails loudly on the first poll instead of reintroducing a flake (hence the ordering note in the comment).

Precedent: commit 32bd46f (#5975) fixed this same store-gateway-not-yet-ACTIVE race with the identical inline wait in integration/backward_compatibility_test.go.

Why not other approaches

  • Shared helper in integration/util.go: this race is being fixed in two independent PRs (see Scope below); a helper introduced by either would couple their merge order. Three one-line call sites don't yet earn the abstraction, and the in-tree precedent (Fix flaky TestNewDistributorsCanPushToOldIngestersWithReplication test due SG not ready #5975) is inline. A helper can be promoted in a later sweep.
  • e2e.WaitMissingMetrics: provably unnecessary after the tokens wait (the ACTIVE series exists, zero-filled, from the first snapshot onward), and it would trade an immediate loud failure on misplacement for silent busy-polling.
  • Query retry / timeout bumps in the e2e client: would mask genuine first-query 500s — the rejected shape of Fix flaky integration tests: increase e2e timeouts and retry limits #7306.
  • Production changes (admit JOINING in BlocksRead, querier retry on 0 healthy replicas, register tokens only at ACTIVE): the observed behavior is the designed cold-start guarantee — changing it for a test race would weaken the consistency model during rollouts.

Scope

Which issue(s) this PR fixes

Fixes #7605

Checklist

  • Tests updated (the change is itself a test fix)
  • Documentation added — N/A (no flags or config changed; make doc not required)
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
  • docs/configuration/v1-guarantees.md updated if this PR introduces experimental flags — N/A

Test plan

  • gofmt -l / goimports -local github.com/cortexproject/cortex -l integration/querier_test.go — clean.
  • go vet -tags "slicelabels,integration,integration_querier" ./integration/ — clean.
  • go test -c -tags "slicelabels,integration,integration_querier" -o /dev/null ./integration/ — compiles.
  • The Docker-based run is exercised by CI (integration_querier job). Since the flake is a race, a flake-count run can't prove absence; the determinism argument above (the waited gauge and the query path read the same lock-guarded ring state) is the stability proof. Reviewer repro recipe (not run here; requires Docker): temporarily add -consul.watch-rate-limit=0.2 to the querier's flags to widen the JOINING-view window to ~5s — the unpatched test should then fail near-deterministically while the patched test passes. The knob is intentionally not part of this PR.

AI assistance disclosure (per GENAI_POLICY.md): the bulk of this contribution — CI log decoding, root-cause analysis, the test change, and this description — was produced with AI assistance and reviewed by the contributor.

🤖 Generated with Claude Code

…estQuerierWithBlocksStorageOnMissingBlocksFromStorage

The first happy-path query in this integration test intermittently
failed with a 500 on arm64 CI (issue cortexproject#7605). Decoding the gzipped 500
response body logged by the querier in both failing runs shows the
failure is querier-local and the query never reached the store-gateway:

  expanding series: failed to get store-gateway replication set owning
  the block <ULID>: at least 1 healthy replica required, could only
  find 0 - unhealthy instances: 172.18.0.8:9095

The store-gateway registers in the ring with all its tokens in JOINING
state (pkg/storegateway/gateway.go:461), runs the initial blocks sync
(which is what drives cortex_bucket_store_blocks_loaded to 1), and only
then switches to ACTIVE (gateway.go:333). The test's existing waits
(querier ring tokens 512*2, store-gateway ring tokens 512,
blocks_loaded 1) are therefore all satisfiable while the querier's view
of the store-gateway ring still holds the instance in JOINING, and the
BlocksRead ring operation admits ACTIVE instances only
(pkg/storegateway/gateway_ring.go:49-55), so the first query fails with
the error above (pkg/ring/replication_strategy.go:93, wrapped at
pkg/querier/blocks_store_replicated_set.go:127). The window is
structural: the querier's consul watch is rate-limited to 1 req/s
(pkg/ring/kv/consul/client.go:79), so the JOINING->ACTIVE flip can
reach the querier's ring view over a second after the store-gateway
CASed it, while all three metric waits can pass on their first poll.

Close the race by additionally waiting until the querier exposes
cortex_ring_members{name="store-gateway-client",state="ACTIVE"} == 1
before the first query - the same inline readiness wait used to fix the
identical race in backward_compatibility_test.go (cortexproject#5975, 32bd46f).
The waited gauge is computed from the same mutex-guarded ring
descriptor that the querier's GetClientsFor consults, so the wait
condition is the exact negation of the failure condition. The
deliberate post-deletion 500 assertions are unchanged.

Fixes cortexproject#7605

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@sandy2008 sandy2008 force-pushed the fix/flaky-querier-missing-blocks-sg-active-wait branch from 73bc245 to 9101b6f Compare June 11, 2026 06:55
Comment thread CHANGELOG.md Outdated
…-wait

Signed-off-by: Charlie Le <charlie_le@apple.com>
@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Jun 23, 2026
@CharlieTLe CharlieTLe enabled auto-merge (squash) June 23, 2026 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/querier lgtm This PR has been approved by a maintainer size/XS type/flaky-test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: TestQuerierWithBlocksStorageOnMissingBlocksFromStorage (integration_querier, arm64) — transient 500 on the pre-deletion query

3 participants