Skip to content

fix(detection): evict stale same-type findings after each scan#86

Merged
bakayolo merged 3 commits into
block:mainfrom
awolden:awolden/evict-stale-findings
Jun 10, 2026
Merged

fix(detection): evict stale same-type findings after each scan#86
bakayolo merged 3 commits into
block:mainfrom
awolden:awolden/evict-stale-findings

Conversation

@awolden

@awolden awolden commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Why

Resources deleted from AWS keep reappearing because the collector's worker-local store never evicts. SaveFindings upserts by ResourceID and nothing deletes, so a resource that drops out of the Wiz report stays in every snapshot until the pod restarts. Prod runs a pinned image with a long-lived pod, so lambdas deleted on 06-05 were re-emitted daily until the pod bounced.

What

  • Adds Store.ReplaceFindings(resourceType, findings): a scan's output atomically becomes the type's full set, evicting entries that dropped out of the inventory. Replaces the never-called timestamp-based DeleteStaleFindings, which would have coupled eviction correctness to an implicit clock contract between the activity and store implementations
  • StoreInput gains ResourceType, passed through from the detection workflow
  • An empty batch empties the type, so a fleet that drains to zero converges instead of re-emitting its last findings forever
  • A zero ResourceType (inputs recorded by in-flight workflows before the field existed) falls back to plain upserts, keeping today's behavior

@awolden awolden force-pushed the awolden/evict-stale-findings branch from f074001 to cd0f8b1 Compare June 10, 2026 21:12
@awolden awolden closed this Jun 10, 2026
@awolden awolden reopened this Jun 10, 2026
The worker-local store only upserted by ResourceID, so resources deleted
from the cloud persisted in every snapshot until the pod restarted, and
long-lived workers kept re-emitting findings for deleted resources.

- Add Store.ReplaceFindings(resourceType, findings): a scan's output
  atomically becomes the type's full set under one lock. Drops the
  never-called timestamp-based DeleteStaleFindings, which would have
  coupled eviction to an implicit clock contract between the activity
  and store implementations.
- StoreInput gains ResourceType, passed from the detection workflow.
- An empty batch empties the type, so a fleet that drains to zero
  converges instead of re-emitting its last findings forever.
- A zero ResourceType (inputs recorded by in-flight workflows before
  the field existed) falls back to plain upserts.
@awolden awolden force-pushed the awolden/evict-stale-findings branch from cd0f8b1 to da802d0 Compare June 10, 2026 21:13
@awolden awolden marked this pull request as ready for review June 10, 2026 21:17
@awolden awolden requested a review from a team as a code owner June 10, 2026 21:17
@bakayolo

Copy link
Copy Markdown
Collaborator

Added the single-scan safeguard on top of this safer per-resource-type eviction fix.

Additional behavior now in #86:

  • manual/HTTP scans use singleton workflow ID version-guard-active-scan
  • Temporal returns an already-running error for overlapping manual scans
  • HTTP maps that to 409 Conflict
  • schedules use the same workflow ID and explicit overlap SKIP
  • startup CLI hint now includes the singleton workflow ID

Verification:

  • go test ./pkg/scan ./pkg/schedule ./pkg/workflow/detection ./pkg/store/memory
  • make test
  • docker compose with live Wiz creds from .env: first full scan returned 202, immediate second scan returned 409, scan completed and persisted snapshot 26b2c2a2-3e82-4abc-bc9d-ff15146fd798 with 28,232 findings across 10 resource types.

@bakayolo

Copy link
Copy Markdown
Collaborator

Final singleton hardening added after the previous comment:

  • OrchestratorWorkflow now rejects executions whose workflow ID is not version-guard-active-scan, so direct Temporal starts with a random workflow ID fail before scanning.
  • HTTP/schedule paths already use version-guard-active-scan; HTTP overlap returns 409 Conflict, schedule overlap remains SKIP.

Final verification:

  • go test ./pkg/workflow/orchestrator ./pkg/scan ./pkg/schedule ./pkg/workflow/detection ./pkg/store/memory
  • make test
  • docker compose with live Wiz creds from .env: first full scan returned 202, immediate second scan returned 409, scan completed and persisted snapshot a3b54470-3cb4-434c-8a46-fdf1c5ad14f0 with 28,232 findings across 10 resource types.

@bakayolo bakayolo merged commit 3440507 into block:main Jun 10, 2026
6 of 7 checks passed
@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.00000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/workflow/detection/activities.go 66.66% 1 Missing and 2 partials ⚠️
Files with missing lines Coverage Δ
pkg/scan/handler.go 100.00% <100.00%> (ø)
pkg/scan/scan.go 100.00% <100.00%> (ø)
pkg/schedule/schedule.go 79.79% <100.00%> (ø)
pkg/store/memory/store.go 94.59% <100.00%> (ø)
pkg/workflow/detection/workflow.go 84.52% <100.00%> (ø)
pkg/workflow/orchestrator/workflow.go 80.46% <100.00%> (ø)
pkg/workflow/detection/activities.go 88.05% <66.66%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants