Benchmark comparison gates#804
Conversation
Implement repeated benchmark runs and the fedify bench compare command so CI can compare a head ref against a base ref on the same runner. The comparison path checks out temporary worktrees, starts each target sequentially, and fails when performance regressions exceed the configured tolerance plus the measured noise band. Publish benchmark report schema version 3 and comparison report schema version 1, including fixtures and schema drift guards for both formats. fedify-dev#744 fedify-dev#786 Assisted-by: Codex:gpt-5.5
Explain repeated benchmark runs, the fedify bench compare workflow, and the CI versus controlled-runner profile split. Record the CLI changelog entry for the comparison command and the report schema updates. fedify-dev#744 fedify-dev#786 Assisted-by: Codex:gpt-5.5
Keep benchmark target output away from stdout, enforce readiness timeouts, terminate Windows target process trees, and avoid infinite noise allowances in comparison reports. fedify-dev#744 fedify-dev#786 Assisted-by: Codex:gpt-5.5
Treat missing server-side measurements in any repeated benchmark run as missing from the aggregate so signature and queue expectations cannot pass on a partial subset of runs. fedify-dev#744 fedify-dev#786 Assisted-by: Codex:gpt-5.5
Show repeated run counts, total duration, and open-loop request volume in benchmark dry-run plans. Compare benchmark reports by scenario position plus name and type so duplicate scenario names cannot be matched to the wrong base result. fedify-dev#744 fedify-dev#786 Assisted-by: Codex:gpt-5.5
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a bench "compare" mode, enables repeated scenario runs (default 3), aggregates per-run metrics into report-v3, computes noise-adjusted regressions, implements worktree/target orchestration and rendering, and updates schemas, fixtures, tests, docs, and changelog. ChangesBenchmark Compare and Repeated Runs
Sequence Diagram(s)sequenceDiagram
participant CLI
participant GitWorktree
participant TargetProcess
participant BenchRunner
participant ReportWriter
CLI->>GitWorktree: create worktree(base)
CLI->>GitWorktree: create worktree(head)
GitWorktree-->>CLI: worktrees ready
CLI->>TargetProcess: startBenchmarkTarget (base worktree)
TargetProcess-->>CLI: started (pid)
CLI->>CLI: waitReadyUrl(base.readyUrl)
CLI->>BenchRunner: defaultRunBenchInWorktree(base)
BenchRunner-->>CLI: BenchReport(base)
CLI->>TargetProcess: stopTargetProcess(base)
CLI->>TargetProcess: startBenchmarkTarget (head worktree)
CLI->>CLI: waitReadyUrl(head.readyUrl)
CLI->>BenchRunner: defaultRunBenchInWorktree(head)
BenchRunner-->>CLI: BenchReport(head)
CLI->>ReportWriter: buildCompareReport(base, head) -> render/write (json/markdown/text)
ReportWriter-->>CLI: written output
CLI->>GitWorktree: remove worktrees
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@codex review |
There was a problem hiding this comment.
Code Review
This pull request introduces the fedify bench compare command to enable CI-friendly performance regression testing by comparing base and head git revisions. It also adds support for repeated benchmark runs, aggregates metrics (such as median latency and pessimistic success rates), and updates the benchmark report schema to version 3 alongside a new comparison report schema. The review feedback highlights opportunities to improve robustness, specifically by adding defensive checks in metricValue to prevent null pointer exceptions, implementing a secondary timeout in stopTargetProcess to avoid indefinite hangs, and prioritizing abort signal reasons over transport-level errors in waitReadyUrl.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Codex Review: Didn't find any major issues. What shall we delve into next? Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/cli/src/bench/command.test.ts`:
- Around line 108-120: Add a regression test that asserts a bare "bench compare"
subcommand without any flags or refs fails parsing to guard against parser
disambiguation; specifically, in the existing test file add a new test that
calls parse(benchCommand, [COMMAND, "compare"]) and asserts result.success is
false (using the same assert.ok(!result.success) style), ensuring you reference
the existing benchCommand, parse, and COMMAND symbols so this covers the bare
subcommand case.
In `@packages/cli/src/bench/compare/schema.ts`:
- Around line 83-106: The benchmarkReport schema currently marks "target" as
required but doesn't define its shape, allowing invalid values; update the
benchmarkReport.properties to include a "target" entry (e.g., "target": {
"$ref": "`#/`$defs/target" } or an appropriate explicit object/type schema) and
add a corresponding $defs.target definition (or refine the existing $defs) that
constrains allowed fields/types for the target object so null/number/other
invalid types are rejected; ensure the new target schema matches the project's
expected target fields and validation style used elsewhere (e.g., other $defs
like tool/environment) so the required "target" is properly typed and validated.
In `@packages/cli/src/bench/mod.ts`:
- Around line 7-11: Add a JSDoc comment above the exported runBench function
describing its purpose (dispatching to runBenchCompare or runBenchSuite based on
the command.mode), document the parameter (BenchCommand) including expected
properties used (e.g., mode values "compare" vs others), and specify the return
type Promise<void>; reference the exported symbol runBench so the comment is
placed immediately above that function and include any relevant tags like `@param`
and `@returns`.
In `@packages/cli/src/bench/result/schema.ts`:
- Around line 353-367: The schema for scenarioResult in reportSchemaV2 currently
adds "runCount" but leaves "runs" optional; update the scenarioResult required
list to include "runs" as well so that whenever runCount is present the runs
array is required — specifically modify the spread of
reportSchemaV2.$defs.scenarioResult.required inside the scenarioResult
definition to add "runs" (in addition to "runCount") so the properties runCount
and runs are both required in the scenarioResult schema.
In `@schema/bench/report-v3.json`:
- Around line 392-478: The schema allows runCount>1 without requiring per-run
data; update the "scenarioResult" schema so that when "runCount" is >= 2 the
"runs" property is required. Add an if/then at the same level as "properties" in
the "scenarioResult" definition: if {"properties":{"runCount":{"minimum":2}}}
then {"required":["runs"]}. Target the "scenarioResult" object and its
"runCount" and "runs" properties to enforce this invariant.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 74e2c8ec-7ff9-44f4-8596-5b8b851aff26
📒 Files selected for processing (23)
CHANGES.mddocs/manual/benchmarking.mdpackages/cli/src/bench/__fixtures__/compare-reports/basic.jsonpackages/cli/src/bench/__fixtures__/reports/inbox-report.jsonpackages/cli/src/bench/action.test.tspackages/cli/src/bench/action.tspackages/cli/src/bench/command.test.tspackages/cli/src/bench/command.tspackages/cli/src/bench/compare.test.tspackages/cli/src/bench/compare.tspackages/cli/src/bench/compare/schema.tspackages/cli/src/bench/mod.tspackages/cli/src/bench/render/render.test.tspackages/cli/src/bench/result/build.test.tspackages/cli/src/bench/result/build.tspackages/cli/src/bench/result/model.tspackages/cli/src/bench/result/schema.tspackages/cli/src/bench/scenario/normalize.test.tspackages/cli/src/bench/scenario/normalize.tspackages/cli/src/bench/schema.test.tspackages/cli/src/bench/schemas.tsschema/bench/compare-report-v1.jsonschema/bench/report-v3.json
Bound target shutdown after SIGKILL, keep readiness timeout errors tied back to the abort reason, and make malformed comparison metrics fail instead of throwing. Tighten benchmark report schemas so embedded targets are typed and repeated reports must include per-run data. Add parser and public-export coverage requested in review. fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) Assisted-by: Codex:gpt-5.5
✅ Deploy Preview for fedify-json-schema ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new fedify bench compare command for CI-friendly performance regression gates, allowing sequential benchmark runs of base and head git revisions inside temporary worktrees. It also adds support for repeated benchmark runs (defaulting to 3) and aggregates their results (median latency/throughput, pessimistic success rate, merged histograms) into a new benchmark report schema version 3. The review feedback focuses on improving the robustness of the comparison and process management logic in packages/cli/src/bench/compare.ts. Specifically, the reviewer suggests matching scenarios by name rather than relying on exact ordering, cleaning up partial git worktrees on checkout failures to prevent leaks, and resolving target process termination immediately if the process ID is null to avoid hanging.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
packages/cli/src/bench/mod.ts (1)
7-7:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winExpand JSDoc to document parameters, return type, and dispatch behavior.
The JSDoc is too brief for a public API. It should document the
commandparameter (including the mode discriminator), thePromise<void>return, and clarify the dispatch behavior (compare vs. run modes).📝 Proposed expanded JSDoc
-/** Runs a parsed benchmark command. */ +/** + * Runs a parsed benchmark command. + * + * Dispatches to {`@link` runBenchCompare} when `command.mode === "compare"`, + * otherwise dispatches to the suite runner for `mode === "run"`. + * + * `@param` command The parsed benchmark command with mode discriminator. + * `@returns` A promise that resolves when the benchmark completes. + */ export function runBench(command: BenchCommand): Promise<void> {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/cli/src/bench/mod.ts` at line 7, Replace the terse JSDoc comment above the benchmark entry function with a full docblock that documents the parameter `command` (describe its shape and the mode discriminator values, e.g., "run" vs "compare"), the return type (Promise<void>), and the dispatch behavior (explain that when command.mode === "compare" it dispatches comparison logic and when command.mode === "run" it executes a single benchmark run, including any side effects such as logging or process exit). Mention any thrown errors or rejection conditions and note that the function performs asynchronous work. Keep the docblock adjacent to the same function so the symbols (the `command` parameter and the function itself) are clearly described.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@packages/cli/src/bench/mod.ts`:
- Line 7: Replace the terse JSDoc comment above the benchmark entry function
with a full docblock that documents the parameter `command` (describe its shape
and the mode discriminator values, e.g., "run" vs "compare"), the return type
(Promise<void>), and the dispatch behavior (explain that when command.mode ===
"compare" it dispatches comparison logic and when command.mode === "run" it
executes a single benchmark run, including any side effects such as logging or
process exit). Mention any thrown errors or rejection conditions and note that
the function performs asynchronous work. Keep the docblock adjacent to the same
function so the symbols (the `command` parameter and the function itself) are
clearly described.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c2b9d19a-c0a4-4840-9225-3826a851781c
📒 Files selected for processing (9)
packages/cli/src/bench/command.test.tspackages/cli/src/bench/compare.test.tspackages/cli/src/bench/compare.tspackages/cli/src/bench/compare/schema.tspackages/cli/src/bench/mod.tspackages/cli/src/bench/result/schema.tspackages/cli/src/bench/schema.test.tsschema/bench/compare-report-v1.jsonschema/bench/report-v3.json
Document every published benchmark schema on the schema landing page, including the current scenario and report versions plus the comparison report schema. Keep the schema directory README aligned with the same published files and source locations. Assisted-by: Codex:gpt-5.5
Match scenarios by name, type, and occurrence so reordered suites do not produce false missing-scenario comparisons. Clean partial worktree registrations after failed checkout and treat child processes without a pid as already stopped. fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) Assisted-by: Codex:gpt-5.5
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the fedify bench compare command to enable CI-friendly performance regression checks by comparing base and head git revisions. It also updates the benchmark report schema to version 3 and scenario schema to version 2 to support and aggregate repeated runs. Review feedback suggests gracefully handling cases where the base metric is zero in regressionRatio for 'higher-is-better' metrics, and ensuring that the Windows process tree is only force-killed with /F when the signal is SIGKILL to allow graceful shutdowns on SIGTERM.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b880cc7c4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Keep compare's derived target separate from the explicit CLI target signal, so unsafe overrides still require --target. Also handle zero throughput baselines as improvements and let Windows targets receive a graceful taskkill before forced cleanup. fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) Assisted-by: Codex:gpt-5.5
|
/gemini review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1d2edea8be
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Race target exits against readiness, make interrupted comparisons stop the active target and clean temporary worktrees, and reject cleanly if forced target termination throws. fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) Assisted-by: Codex:gpt-5.5
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the fedify bench compare command to enable CI-friendly performance regression gates by comparing base and head git revisions. It also implements support for repeated benchmark runs (defaulting to three runs), aggregating their results (using median values for latency/throughput and pessimistic correctness results), and updates the benchmark report schema to version 3. Feedback on the changes suggests using a type assertion to cast fieldValues to readonly number[] in packages/cli/src/bench/result/build.ts to resolve a TypeScript compilation error under strict null checks.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 41d38f42bc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Carry interrupt and target-exit cancellation through compare readiness, in-process benchmark runs, runner fetches, and load-generator sleeps so raced benchmark work does not continue after cleanup starts. fedify-dev#804 (comment) fedify-dev#804 (comment) Assisted-by: Codex:gpt-5.5
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the fedify bench compare command, which enables CI-friendly performance regression testing by comparing a base and head git revision on the same runner. It automates checking out revisions into temporary worktrees, starting target processes, waiting for readiness, and running the benchmark suite. Additionally, the PR implements support for repeated benchmark runs (defaulting to three) and aggregates their metrics (using median values for latency/throughput and pessimistic success rates) to ensure stable results. To support these features, the benchmark report schema has been updated to version 3, a new comparison report schema version 1 has been introduced, and abort signals have been integrated to cancel in-flight benchmark tasks. Since no review comments were provided, there is no further feedback to address.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Codex Review: Didn't find any major issues. Nice work! Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
2chanhaeng
left a comment
There was a problem hiding this comment.
Great work! It would be even better if you fixed these parts.
Reject ambiguous unitless max-regression values greater than one, so users who mean 15% do not accidentally configure a 1500% tolerance. Make fanout and remote-failure queue polling use the scenario clock and abort signal. Interrupted comparison runs can now stop pending drain polls instead of waiting until the queue drain timeout expires. fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) Assisted-by: Codex:gpt-5.5
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the fedify bench compare command, which enables CI-friendly performance regression testing by comparing a base and head Git ref sequentially in temporary worktrees. It also updates the benchmark report schema to version 3 to support repeated runs and introduces comparison report schema version 1. Additionally, abort signal handling has been integrated across benchmark runs to support cancellation. The review feedback highlights two critical issues: a potential unhandled promise rejection in the target process exit handler in compare.ts when stopping normally, and a bug in generator.ts where an undefined semaphore limit causes the benchmark to block indefinitely.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Codex Review: Didn't find any major issues. Chef's kiss. Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Normal target shutdown can make the target exit promise reject after the comparison has already completed a side. Keep that exit from propagating as a late target failure while still aborting active work for unexpected process exits. fedify-dev#804 (comment) Assisted-by: Codex:gpt-5.5
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the fedify bench compare command to enable CI-friendly performance regression gates by comparing base and head git revisions. It also implements repeated benchmark runs (defaulting to three) and aggregates their results using median metrics and pessimistic correctness. Additionally, support for aborting in-flight benchmark tasks via AbortSignal has been added. Feedback on the changes highlights a critical bug in the semaphore implementation within packages/cli/src/bench/load/generator.ts, where an undefined max value causes open-loop benchmarks to hang indefinitely due to an incorrect comparison (count < undefined).
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Codex Review: Didn't find any major issues. Another round soon, please! Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
This adds
fedify bench compare, a CI-oriented command for checking benchmark regressions between two Git refs. The command checks out the base/head refs in temporary worktrees, starts each target with the same command, waits for a ready URL, runs the same benchmark suite, and reports whether the head result stays within the configured regression allowance.The comparison uses the explicit
--max-regressionvalue plus a measured noise band from repeated runs, so the gate can be stricter than a broad absolute threshold without failing on small local variance. When a scenario does not name performance metrics inexpect, the comparison falls back tolatency.p95andthroughputPerSec.This also makes repeated runs real in
fedify bench. Scenarios now run three times by default, aggregate latency and throughput with median values, keep correctness gates pessimistic, and preserve per-run data in the JSON report. Dry-run output now shows the run count, total scheduled duration, and open-loop request volume so users can review benchmark impact before sending load.The report format is versioned as benchmark report schema v3 and comparison report schema v1. The published schemas live in schema/bench/report-v3.json and schema/bench/compare-report-v1.json, with user documentation in docs/manual/benchmarking.md and a release note in CHANGES.md.
Refs #744. Fixes #786.