Skip to content

[XPU] fix_same_req_id#8040

Open
cmcamdy wants to merge 1 commit into
PaddlePaddle:developfrom
cmcamdy:fix_same_req_id
Open

[XPU] fix_same_req_id#8040
cmcamdy wants to merge 1 commit into
PaddlePaddle:developfrom
cmcamdy:fix_same_req_id

Conversation

@cmcamdy

@cmcamdy cmcamdy commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@codecov-commenter

codecov-commenter commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.00000% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@fab344e). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/common_engine.py 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #8040   +/-   ##
==========================================
  Coverage           ?   67.72%           
==========================================
  Files              ?      471           
  Lines              ?    66361           
  Branches           ?    10217           
==========================================
  Hits               ?    44946           
  Misses             ?    18546           
  Partials           ?     2869           
Flag Coverage Δ
GPU 77.79% <80.00%> (?)
XPU 6.99% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-11 17:13:21

📋 Review 摘要

PR 概述:在 PD decode 预分配资源时新增重复 request_id 拒绝逻辑,并保留 D 侧返回的错误原因。
变更范围fastdeploy/engine/common_engine.pyfastdeploy/engine/sched/resource_manager_v1.py
影响面 Tag[Engine] [Scheduler] [PD Disaggregation]

问题

级别 文件 概述
🔴 Bug fastdeploy/engine/sched/resource_manager_v1.py:1596 cache-task 模式下重复 request_id 被当作资源不足重试,P/D 会永久等待

📝 PR 规范检查

标题 Tag 使用 [XPU],但本次 diff 修改的是 Engine/Scheduler 的 PD decode 资源预分配逻辑,未触及 XPU 专用 worker/model_runner/ops;PR 描述仍是模板占位内容,缺少具体 Motivation/Modifications/Usage/Accuracy Tests 内容。建议替换为以下完整内容。

标题建议(可直接复制):

  • [PD Disaggregation] Fix duplicate request id handling in decode preallocation
PR 描述建议(点击展开,可直接复制)
## Motivation
修复 P/D 分离场景下 Decode 侧收到重复 request_id 时可能复用或污染已有 KV cache 的问题。

## Modifications
- `fastdeploy/engine/sched/resource_manager_v1.py`: 在 Decode 侧资源预分配时检测 `request_id` 是否已存在于 `self.requests`,重复时设置错误信息并拒绝分配。
- `fastdeploy/engine/common_engine.py`: 在资源预分配失败回传给 Prefill 时保留 Decode 侧已经设置的错误原因,避免统一覆盖为 `Not enough resources`## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

修复方向能避免 D 侧同一 request_id 复用已有 block,但当前永久失败和临时资源不足共用 False,会在 cache-task 模式下让重复请求卡住。需要先拆分失败语义,或在已有 error_msg 时回传错误并移除队列。

Comment thread fastdeploy/engine/sched/resource_manager_v1.py
@PaddlePaddle-bot

PaddlePaddle-bot commented Jun 13, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-21 20:01:09

CI报告基于以下代码生成(30分钟更新一次):
PR commit: cb98744 | Merge base: fab344e (branch: develop)


1 Required任务 : 9/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
41(0) 41 37 4 0 0 0
任务 错误类型 置信度 日志
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage PR问题 Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)

分析器: ci_analyze_unittest_fastdeploy
失败用例: 无 pytest 失败用例;失败发生在 diff coverage 门槛检查。

用例 错误摘要
diff-cover PR diff 覆盖率低于 80%,workflow 将 COVERAGE_EXIT_CODE 置为 9

关键日志:

[FAILURE]: Process completed with exit code 9.
.github/workflows/_unit_test_coverage.yml:254 diff-cover ... --fail-under=80 ... || COVERAGE_EXIT_CODE=9
.github/workflows/_unit_test_coverage.yml:387-404 COVERAGE_EXIT_CODE=9 时退出 9
  • 根因摘要: PR新增分支未满足diff覆盖率

TEST_EXIT_CODE=8 才代表单测失败,本次失败摘要是 exit code 9;结合 workflow,9 对应 PR diff coverage 低于 80%。本 PR 新增/修改了 preallocate_resource_in_d 的重复 request_id 拒绝分支,以及 common_engine.py 中保留已有 task.error_msg 的分支,现有覆盖未覆盖这些新增路径。

修复建议:

  1. tests/v1/test_resource_manager_v1.pytest_preallocate_resource_in_p_and_d 附近补充重复 request_id 场景:先把同 id 请求放入 manager_d.requests,再调用 preallocate_resource_in_d,断言返回 Falserequest.error_msg == "Duplicate request id in decode"
  2. 补充 decode 预分配失败且 task.error_msg 已存在的单测,断言 common_engine.py 不再覆盖原始错误为 "Not enough resources",而是把原始错误传给 send_cache_info_to_prefill

关联变更: fastdeploy/engine/sched/resource_manager_v1.pyfastdeploy/engine/common_engine.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants