Skip to content

[ab-advisor] Experiment campaign for architecture-guardian: A/B test sub_agent_strategy #39062

@github-actions

Description

@github-actions

🧪 Experiment Campaign: architecture-guardian

Workflow file: .github/workflows/architecture-guardian.md
Selected dimension: sub_agent_strategy
Triggered by: ab-testing-advisor on 2026-06-13


Background

architecture-guardian runs every weekday to detect structural violations (oversized files, large functions, excessive exports, import cycles) in recently changed Go and JavaScript files. Metrics are pre-computed in a shell pre-step and written to /tmp/gh-aw/agent/arch-metrics.json; the main agent then delegates threshold-comparison to a dedicated violation-classifier sub-agent (model: small) which returns a typed JSON categorization. Since the classification task is purely mechanical — comparing numeric fields against fixed thresholds already present in the pre-computed JSON — inlining this logic in the main prompt is a strong candidate for reducing AI credit spend without sacrificing detection quality.

Hypothesis

H0 (null): The single_agent variant does not change ai_credits_spent compared to the sub_agents baseline.

H1 (alternative): The single_agent variant reduces ai_credits_spent by ≥15% by eliminating the sub-agent model-call overhead, with no degradation in violation detection accuracy or run reliability.

Experiment Configuration

Add the following experiments: block to the workflow frontmatter (after tracker-id:):

experiments:
  sub_agent_strategy:
    variants: [sub_agents, single_agent]
    description: "Test whether inlining violation classification in the main agent reduces AI credit spend without sacrificing detection accuracy"
    hypothesis: "H0: single_agent does not change ai_credits_spent vs sub_agents. H1: single_agent reduces ai_credits_spent by ≥15% by eliminating sub-agent model-call overhead"
    metric: ai_credits_spent
    secondary_metrics: [run_duration_ms, violation_count_delta]
    guardrail_metrics:
      - name: run_failure_rate
        direction: min
        threshold: 0.10
      - name: empty_output_rate
        direction: min
        threshold: 0.05
    min_samples: 30
    weight: [50, 50]
    start_date: "2026-06-13"
    issue: <fill-in-after-creation>

Variant descriptions:

  • sub_agents (baseline / current behavior): Main agent calls the violation-classifier sub-agent (model: small), which reads the metrics JSON, applies thresholds, and returns a typed JSON categorization.
  • single_agent: Main agent applies threshold rules directly, classifying violations inline without spawning a sub-agent. The violation-classifier agent definition remains in the file but is not invoked.

Workflow Changes Required

Only Step 2 of the main prompt body needs to change. Wrap the existing sub-agent call with a handlebars value-comparison conditional. The correct syntax is {{#if experiments.sub_agent_strategy == "<variant>" }}. Always compare against a specific variant value — never use the internal __GH_AW_EXPERIMENTS__ env-var syntax.

Before (current Step 2):

 ## Step 2: Classify Violations by Severity

-Use the `violation-classifier` agent to read `/tmp/gh-aw/agent/arch-metrics.json` and return the categorized violation list. If it returns `{"noop": true}`, skip to the noop call in Step 3.

After (Step 2 with experiment conditional):

 ## Step 2: Classify Violations by Severity

+{{#if experiments.sub_agent_strategy == "single_agent" }}
+Read the metrics JSON already loaded in Step 1. Apply the following rules using the `thresholds` values directly:
+
+- **BLOCKER**: `import_cycles` non-empty → import cycle; `files[].lines > thresholds.file_lines_blocker` → oversized file
+- **WARNING**: `files[].lines > thresholds.file_lines_warning` → near-limit file; Go `func_data` entries with line count > `thresholds.function_lines` → oversized function
+- **INFO**: `files[].export_count > thresholds.max_exports` → excessive exports
+
+Build `blockers`, `warnings`, and `infos` arrays from this analysis and proceed to Step 3.
+{{else}}
 Use the `violation-classifier` agent to read `/tmp/gh-aw/agent/arch-metrics.json` and return the categorized violation list. If it returns `{"noop": true}`, skip to the noop call in Step 3.
+{{/if}}

Success Metrics

Metric Type Target
ai_credits_spent Primary ≥15% reduction in single_agent vs sub_agents
run_duration_ms Secondary Expected decrease (no sub-agent turn RTT)
violation_count_delta Secondary ≤5% difference between variants (accuracy parity)
run_failure_rate Guardrail Must not exceed 10%
empty_output_rate Guardrail Must not exceed 5%

Statistical Design

  • Variants: sub_agents (baseline) vs single_agent
  • Assignment: Round-robin via gh-aw experiments runtime (cache-based)
  • Minimum runs per variant: 30
  • Expected daily run frequency: ~5 per week (weekdays, 14:00 UTC schedule)
  • Expected experiment duration: ~12 weeks (~2.5 non-noop runs/variant/week × 30 samples needed)
  • Analysis approach: Mann-Whitney U test (non-parametric; robust to skewed AI-credit distributions)
Power analysis details

Assumptions:

  • Baseline AI credits ≈ 2,000–4,000 per run (main agent + small sub-agent call)
  • Coefficient of variation ≈ 30%
  • Minimum detectable effect: 15% reduction in mean credits
  • α = 0.05, power = 80%

A two-sample Mann-Whitney U test requires approximately 28–32 observations per group at these parameters → 30 runs per variant is the conservative target. Given the weekday-only schedule and the possibility of noop runs (days with no Go/JS changes), instrument the experiment to count only non-noop runs toward the sample size.

Implementation Steps

  • Add experiments: section to frontmatter (after tracker-id:)
  • Add conditional block to Step 2 using {{#if experiments.sub_agent_strategy == "single_agent" }} (value-comparison form — never use the internal __GH_AW_EXPERIMENTS__ env-var syntax)
  • Fill in the issue: field with this issue's number
  • Run gh aw compile architecture-guardian to regenerate lock file
  • Monitor experiment artifact uploaded per run to /tmp/gh-aw/agent/experiments/state.json
  • After 30 non-noop runs per variant (~12 weeks), analyze variant distribution via workflow run artifacts
  • Document findings and promote winning variant; remove losing conditional branch

Infrastructure Status

Experiment infrastructure is complete. A field-presence-checker run on 2026-06-13 confirmed analysis_type, tags, and notify are all present in both pkg/workflow/compiler_experiments.go and actions/setup/js/pick_experiment.cjs and surfaced in run step summaries. No infrastructure sub-issue is needed.

Note: the notify field is schema-complete and rendered in step summaries, but actual notification dispatch (posting to discussions/issues when significance is reached) is not yet wired — this is a known gap that can be addressed separately if needed.

References

Generated by 🧪 Daily A/B Testing Advisor · 262.4 AIC · ⌖ 29.8 AIC · ⊞ 22.4K ·

  • expires on Jun 27, 2026, 3:16 AM UTC-08:00

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions