🧪 Experiment Campaign: architecture-guardian
Workflow file: .github/workflows/architecture-guardian.md
Selected dimension: sub_agent_strategy
Triggered by: ab-testing-advisor on 2026-06-13
Background
architecture-guardian runs every weekday to detect structural violations (oversized files, large functions, excessive exports, import cycles) in recently changed Go and JavaScript files. Metrics are pre-computed in a shell pre-step and written to /tmp/gh-aw/agent/arch-metrics.json; the main agent then delegates threshold-comparison to a dedicated violation-classifier sub-agent (model: small) which returns a typed JSON categorization. Since the classification task is purely mechanical — comparing numeric fields against fixed thresholds already present in the pre-computed JSON — inlining this logic in the main prompt is a strong candidate for reducing AI credit spend without sacrificing detection quality.
Hypothesis
H0 (null): The single_agent variant does not change ai_credits_spent compared to the sub_agents baseline.
H1 (alternative): The single_agent variant reduces ai_credits_spent by ≥15% by eliminating the sub-agent model-call overhead, with no degradation in violation detection accuracy or run reliability.
Experiment Configuration
Add the following experiments: block to the workflow frontmatter (after tracker-id:):
experiments:
sub_agent_strategy:
variants: [sub_agents, single_agent]
description: "Test whether inlining violation classification in the main agent reduces AI credit spend without sacrificing detection accuracy"
hypothesis: "H0: single_agent does not change ai_credits_spent vs sub_agents. H1: single_agent reduces ai_credits_spent by ≥15% by eliminating sub-agent model-call overhead"
metric: ai_credits_spent
secondary_metrics: [run_duration_ms, violation_count_delta]
guardrail_metrics:
- name: run_failure_rate
direction: min
threshold: 0.10
- name: empty_output_rate
direction: min
threshold: 0.05
min_samples: 30
weight: [50, 50]
start_date: "2026-06-13"
issue: <fill-in-after-creation>
Variant descriptions:
sub_agents (baseline / current behavior): Main agent calls the violation-classifier sub-agent (model: small), which reads the metrics JSON, applies thresholds, and returns a typed JSON categorization.
single_agent: Main agent applies threshold rules directly, classifying violations inline without spawning a sub-agent. The violation-classifier agent definition remains in the file but is not invoked.
Workflow Changes Required
Only Step 2 of the main prompt body needs to change. Wrap the existing sub-agent call with a handlebars value-comparison conditional. The correct syntax is {{#if experiments.sub_agent_strategy == "<variant>" }}. Always compare against a specific variant value — never use the internal __GH_AW_EXPERIMENTS__ env-var syntax.
Before (current Step 2):
## Step 2: Classify Violations by Severity
-Use the `violation-classifier` agent to read `/tmp/gh-aw/agent/arch-metrics.json` and return the categorized violation list. If it returns `{"noop": true}`, skip to the noop call in Step 3.
After (Step 2 with experiment conditional):
## Step 2: Classify Violations by Severity
+{{#if experiments.sub_agent_strategy == "single_agent" }}
+Read the metrics JSON already loaded in Step 1. Apply the following rules using the `thresholds` values directly:
+
+- **BLOCKER**: `import_cycles` non-empty → import cycle; `files[].lines > thresholds.file_lines_blocker` → oversized file
+- **WARNING**: `files[].lines > thresholds.file_lines_warning` → near-limit file; Go `func_data` entries with line count > `thresholds.function_lines` → oversized function
+- **INFO**: `files[].export_count > thresholds.max_exports` → excessive exports
+
+Build `blockers`, `warnings`, and `infos` arrays from this analysis and proceed to Step 3.
+{{else}}
Use the `violation-classifier` agent to read `/tmp/gh-aw/agent/arch-metrics.json` and return the categorized violation list. If it returns `{"noop": true}`, skip to the noop call in Step 3.
+{{/if}}
Success Metrics
| Metric |
Type |
Target |
ai_credits_spent |
Primary |
≥15% reduction in single_agent vs sub_agents |
run_duration_ms |
Secondary |
Expected decrease (no sub-agent turn RTT) |
violation_count_delta |
Secondary |
≤5% difference between variants (accuracy parity) |
run_failure_rate |
Guardrail |
Must not exceed 10% |
empty_output_rate |
Guardrail |
Must not exceed 5% |
Statistical Design
- Variants:
sub_agents (baseline) vs single_agent
- Assignment: Round-robin via
gh-aw experiments runtime (cache-based)
- Minimum runs per variant: 30
- Expected daily run frequency: ~5 per week (weekdays, 14:00 UTC schedule)
- Expected experiment duration: ~12 weeks (~2.5 non-noop runs/variant/week × 30 samples needed)
- Analysis approach: Mann-Whitney U test (non-parametric; robust to skewed AI-credit distributions)
Power analysis details
Assumptions:
- Baseline AI credits ≈ 2,000–4,000 per run (main agent + small sub-agent call)
- Coefficient of variation ≈ 30%
- Minimum detectable effect: 15% reduction in mean credits
- α = 0.05, power = 80%
A two-sample Mann-Whitney U test requires approximately 28–32 observations per group at these parameters → 30 runs per variant is the conservative target. Given the weekday-only schedule and the possibility of noop runs (days with no Go/JS changes), instrument the experiment to count only non-noop runs toward the sample size.
Implementation Steps
Infrastructure Status
✅ Experiment infrastructure is complete. A field-presence-checker run on 2026-06-13 confirmed analysis_type, tags, and notify are all present in both pkg/workflow/compiler_experiments.go and actions/setup/js/pick_experiment.cjs and surfaced in run step summaries. No infrastructure sub-issue is needed.
Note: the notify field is schema-complete and rendered in step summaries, but actual notification dispatch (posting to discussions/issues when significance is reached) is not yet wired — this is a known gap that can be addressed separately if needed.
References
Generated by 🧪 Daily A/B Testing Advisor · 262.4 AIC · ⌖ 29.8 AIC · ⊞ 22.4K · ◷
🧪 Experiment Campaign: architecture-guardian
Workflow file:
.github/workflows/architecture-guardian.mdSelected dimension:
sub_agent_strategyTriggered by:
ab-testing-advisoron 2026-06-13Background
architecture-guardianruns every weekday to detect structural violations (oversized files, large functions, excessive exports, import cycles) in recently changed Go and JavaScript files. Metrics are pre-computed in a shell pre-step and written to/tmp/gh-aw/agent/arch-metrics.json; the main agent then delegates threshold-comparison to a dedicatedviolation-classifiersub-agent (model:small) which returns a typed JSON categorization. Since the classification task is purely mechanical — comparing numeric fields against fixed thresholds already present in the pre-computed JSON — inlining this logic in the main prompt is a strong candidate for reducing AI credit spend without sacrificing detection quality.Hypothesis
H0 (null): The
single_agentvariant does not changeai_credits_spentcompared to thesub_agentsbaseline.H1 (alternative): The
single_agentvariant reducesai_credits_spentby ≥15% by eliminating the sub-agent model-call overhead, with no degradation in violation detection accuracy or run reliability.Experiment Configuration
Add the following
experiments:block to the workflow frontmatter (aftertracker-id:):Variant descriptions:
sub_agents(baseline / current behavior): Main agent calls theviolation-classifiersub-agent (model:small), which reads the metrics JSON, applies thresholds, and returns a typed JSON categorization.single_agent: Main agent applies threshold rules directly, classifying violations inline without spawning a sub-agent. Theviolation-classifieragent definition remains in the file but is not invoked.Workflow Changes Required
Only Step 2 of the main prompt body needs to change. Wrap the existing sub-agent call with a handlebars value-comparison conditional. The correct syntax is
{{#if experiments.sub_agent_strategy == "<variant>" }}. Always compare against a specific variant value — never use the internal__GH_AW_EXPERIMENTS__env-var syntax.Before (current Step 2):
## Step 2: Classify Violations by Severity -Use the `violation-classifier` agent to read `/tmp/gh-aw/agent/arch-metrics.json` and return the categorized violation list. If it returns `{"noop": true}`, skip to the noop call in Step 3.After (Step 2 with experiment conditional):
Success Metrics
ai_credits_spentsingle_agentvssub_agentsrun_duration_msviolation_count_deltarun_failure_rateempty_output_rateStatistical Design
sub_agents(baseline) vssingle_agentgh-awexperiments runtime (cache-based)Power analysis details
Assumptions:
A two-sample Mann-Whitney U test requires approximately 28–32 observations per group at these parameters → 30 runs per variant is the conservative target. Given the weekday-only schedule and the possibility of noop runs (days with no Go/JS changes), instrument the experiment to count only non-noop runs toward the sample size.
Implementation Steps
experiments:section to frontmatter (aftertracker-id:){{#if experiments.sub_agent_strategy == "single_agent" }}(value-comparison form — never use the internal__GH_AW_EXPERIMENTS__env-var syntax)issue:field with this issue's numbergh aw compile architecture-guardianto regenerate lock file/tmp/gh-aw/agent/experiments/state.jsonInfrastructure Status
References
.github/workflows/architecture-guardian.md