[ab-advisor] Experiment campaign for architecture-guardian: A/B test sub_agent_strategy

### 🧪 Experiment Campaign: architecture-guardian

**Workflow file**: `.github/workflows/architecture-guardian.md`
**Selected dimension**: `sub_agent_strategy`
**Triggered by**: `ab-testing-advisor` on 2026-06-13

---

### Background

`architecture-guardian` runs every weekday to detect structural violations (oversized files, large functions, excessive exports, import cycles) in recently changed Go and JavaScript files. Metrics are pre-computed in a shell pre-step and written to `/tmp/gh-aw/agent/arch-metrics.json`; the main agent then delegates threshold-comparison to a dedicated `violation-classifier` sub-agent (model: `small`) which returns a typed JSON categorization. Since the classification task is purely mechanical — comparing numeric fields against fixed thresholds already present in the pre-computed JSON — inlining this logic in the main prompt is a strong candidate for reducing AI credit spend without sacrificing detection quality.

### Hypothesis

**H0 (null)**: The `single_agent` variant does not change `ai_credits_spent` compared to the `sub_agents` baseline.

**H1 (alternative)**: The `single_agent` variant reduces `ai_credits_spent` by ≥15% by eliminating the sub-agent model-call overhead, with no degradation in violation detection accuracy or run reliability.

### Experiment Configuration

Add the following `experiments:` block to the workflow frontmatter (after `tracker-id:`):

```yaml
experiments:
  sub_agent_strategy:
    variants: [sub_agents, single_agent]
    description: "Test whether inlining violation classification in the main agent reduces AI credit spend without sacrificing detection accuracy"
    hypothesis: "H0: single_agent does not change ai_credits_spent vs sub_agents. H1: single_agent reduces ai_credits_spent by ≥15% by eliminating sub-agent model-call overhead"
    metric: ai_credits_spent
    secondary_metrics: [run_duration_ms, violation_count_delta]
    guardrail_metrics:
      - name: run_failure_rate
        direction: min
        threshold: 0.10
      - name: empty_output_rate
        direction: min
        threshold: 0.05
    min_samples: 30
    weight: [50, 50]
    start_date: "2026-06-13"
    issue: <fill-in-after-creation>
```

**Variant descriptions**:
- `sub_agents` *(baseline / current behavior)*: Main agent calls the `violation-classifier` sub-agent (model: `small`), which reads the metrics JSON, applies thresholds, and returns a typed JSON categorization.
- `single_agent`: Main agent applies threshold rules directly, classifying violations inline without spawning a sub-agent. The `violation-classifier` agent definition remains in the file but is not invoked.

### Workflow Changes Required

Only **Step 2** of the main prompt body needs to change. Wrap the existing sub-agent call with a handlebars value-comparison conditional. The correct syntax is `{{#if experiments.sub_agent_strategy == "<variant>" }}`. Always compare against a specific variant value — never use the internal `__GH_AW_EXPERIMENTS__` env-var syntax.

**Before** (current Step 2):

```diff
 ## Step 2: Classify Violations by Severity

-Use the `violation-classifier` agent to read `/tmp/gh-aw/agent/arch-metrics.json` and return the categorized violation list. If it returns `{"noop": true}`, skip to the noop call in Step 3.
```

**After** (Step 2 with experiment conditional):

```diff
 ## Step 2: Classify Violations by Severity

+{{#if experiments.sub_agent_strategy == "single_agent" }}
+Read the metrics JSON already loaded in Step 1. Apply the following rules using the `thresholds` values directly:
+
+- **BLOCKER**: `import_cycles` non-empty → import cycle; `files[].lines > thresholds.file_lines_blocker` → oversized file
+- **WARNING**: `files[].lines > thresholds.file_lines_warning` → near-limit file; Go `func_data` entries with line count > `thresholds.function_lines` → oversized function
+- **INFO**: `files[].export_count > thresholds.max_exports` → excessive exports
+
+Build `blockers`, `warnings`, and `infos` arrays from this analysis and proceed to Step 3.
+{{else}}
 Use the `violation-classifier` agent to read `/tmp/gh-aw/agent/arch-metrics.json` and return the categorized violation list. If it returns `{"noop": true}`, skip to the noop call in Step 3.
+{{/if}}
```

### Success Metrics

| Metric | Type | Target |
|--------|------|--------|
| `ai_credits_spent` | Primary | ≥15% reduction in `single_agent` vs `sub_agents` |
| `run_duration_ms` | Secondary | Expected decrease (no sub-agent turn RTT) |
| `violation_count_delta` | Secondary | ≤5% difference between variants (accuracy parity) |
| `run_failure_rate` | Guardrail | Must not exceed 10% |
| `empty_output_rate` | Guardrail | Must not exceed 5% |

### Statistical Design

- **Variants**: `sub_agents` (baseline) vs `single_agent`
- **Assignment**: Round-robin via `gh-aw` experiments runtime (cache-based)
- **Minimum runs per variant**: 30
- **Expected daily run frequency**: ~5 per week (weekdays, 14:00 UTC schedule)
- **Expected experiment duration**: ~12 weeks (~2.5 non-noop runs/variant/week × 30 samples needed)
- **Analysis approach**: Mann-Whitney U test (non-parametric; robust to skewed AI-credit distributions)

<details>
<summary>Power analysis details</summary>

Assumptions:
- Baseline AI credits ≈ 2,000–4,000 per run (main agent + small sub-agent call)
- Coefficient of variation ≈ 30%
- Minimum detectable effect: 15% reduction in mean credits
- α = 0.05, power = 80%

A two-sample Mann-Whitney U test requires approximately 28–32 observations per group at these parameters → **30 runs per variant** is the conservative target. Given the weekday-only schedule and the possibility of noop runs (days with no Go/JS changes), instrument the experiment to count only non-noop runs toward the sample size.

</details>

### Implementation Steps

- [ ] Add `experiments:` section to frontmatter (after `tracker-id:`)
- [ ] Add conditional block to Step 2 using `{{#if experiments.sub_agent_strategy == "single_agent" }}` (value-comparison form — never use the internal `__GH_AW_EXPERIMENTS__` env-var syntax)
- [ ] Fill in the `issue:` field with this issue's number
- [ ] Run `gh aw compile architecture-guardian` to regenerate lock file
- [ ] Monitor experiment artifact uploaded per run to `/tmp/gh-aw/agent/experiments/state.json`
- [ ] After 30 non-noop runs per variant (~12 weeks), analyze variant distribution via workflow run artifacts
- [ ] Document findings and promote winning variant; remove losing conditional branch

### Infrastructure Status

> ✅ **Experiment infrastructure is complete.** A `field-presence-checker` run on 2026-06-13 confirmed `analysis_type`, `tags`, and `notify` are all present in both `pkg/workflow/compiler_experiments.go` and `actions/setup/js/pick_experiment.cjs` and surfaced in run step summaries. **No infrastructure sub-issue is needed.**
>
> Note: the `notify` field is schema-complete and rendered in step summaries, but actual notification dispatch (posting to discussions/issues when significance is reached) is not yet wired — this is a known gap that can be addressed separately if needed.

### References

- [A/B Testing in gh-aw](https://github.com/github/gh-aw/blob/main/.github/aw/github-agentic-workflows.md)
- Workflow file: `.github/workflows/architecture-guardian.md`







> Generated by [🧪 Daily A/B Testing Advisor](https://github.com/github/gh-aw/actions/runs/27464997809) · 262.4 AIC · ⌖ 29.8 AIC · ⊞ 22.4K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fab-testing-advisor%22&type=issues)
> - [x] expires  on Jun 27, 2026, 3:16 AM UTC-08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ab-advisor] Experiment campaign for architecture-guardian: A/B test sub_agent_strategy #39062

🧪 Experiment Campaign: architecture-guardian

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

Success Metrics

Statistical Design

Implementation Steps

Infrastructure Status

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Type	Target
`ai_credits_spent`	Primary	≥15% reduction in `single_agent` vs `sub_agents`
`run_duration_ms`	Secondary	Expected decrease (no sub-agent turn RTT)
`violation_count_delta`	Secondary	≤5% difference between variants (accuracy parity)
`run_failure_rate`	Guardrail	Must not exceed 10%
`empty_output_rate`	Guardrail	Must not exceed 5%

[ab-advisor] Experiment campaign for architecture-guardian: A/B test sub_agent_strategy #39062

Description

🧪 Experiment Campaign: architecture-guardian

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

Success Metrics

Statistical Design

Implementation Steps

Infrastructure Status

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions