feat(metax): SigLIP vision encoder optimizations (+24% throughput)#8068
feat(metax): SigLIP vision encoder optimizations (+24% throughput)#8068valorix25 wants to merge 1 commit into
Conversation
|
valorix25 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
6b1d7de to
b2f8fe0
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-23 00:18:01
📋 Review 摘要
PR 概述:为 PaddleOCR-VL/SigLIP 在 Metax 上新增 fused RoPE/GELU 优化及 benchmark/server 脚本
变更范围:PaddleOCR-VL 模型热路径、Metax custom op、输入预处理阈值、Metax runner、benchmark/build/run 脚本
影响面 Tag:Models OP Metax DataProcessor Benchmark
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | custom_ops/metax_ops/apply_rope_qkv.cu:87 |
rotate-half Vec4 路径允许 head_dim 只满足偶数,非 8 倍数时会跨 half 边界算错 |
| 🔴 Bug | scripts/envSetup.sh:11 |
新增 run 脚本在 set -u 下 source 时,未设置 LD_LIBRARY_PATH 会直接退出 |
| 🟡 建议 | fastdeploy/model_executor/models/paddleocr_vl/siglip_ops.py:36 |
fused RoPE .so 没有常规构建/打包入口,默认运行会静默回退到 native 路径 |
📝 PR 规范检查
PR 标题未使用 FastDeploy 官方 Tag/Cherry-Pick 格式,描述也缺少 §D2 要求的 Motivation、Modifications、Usage or Command、Accuracy Tests 和 Checklist 结构。
标题建议(可直接复制):
[Cherry-Pick][Metax] Optimize SigLIP vision encoder for PaddleOCR-VL(#8068)
PR 描述建议(点击展开,可直接复制)
## Motivation
Optimize PaddleOCR-VL SigLIP vision encoder performance on Metax C500 and provide scripts for building/running the benchmark.
## Modifications
- Add a Metax fused RoPE dispatch path for PaddleOCR-VL SigLIP and enable fused `gelu_tanh` on Metax.
- Update the Metax `apply_rope_qkv` kernel to use rotate-half RoPE convention.
- Adjust PaddleOCR-VL image resize aspect-ratio limit from 200 to 300.
- Add PaddleOCR-VL FastDeploy benchmark configuration and multi-process benchmark script.
- Add Metax environment, server, benchmark, and standalone RoPE kernel build scripts.
## Usage or Command
bash scripts/build_rope_kernel.sh
bash scripts/run_server.sh
bash scripts/run_benchmark.sh
## Accuracy Tests
N/A(当前 PR 描述/diff 未提供精度对齐结果;标题仅提到 +24% throughput)
## Checklist
- [ ] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
实现方向清晰,但当前 fused RoPE kernel 对 head_dim 的合法值检查不足,新增脚本在常见干净环境会启动失败;另外 fused RoPE 没有纳入常规构建路径,容易让优化在实际运行中静默失效。建议修复上述问题后再合入。
| VecT rot_half_vec = {-qk_vec[1], qk_vec[0], -qk_vec[3], qk_vec[2]}; | ||
| // Load the paired Vec4 from the other half of head_dim for rotate_half. | ||
| // pair_load = load_idx + half_dim if first half, load_idx - half_dim if second. | ||
| const int pair_load = |
There was a problem hiding this comment.
🔴 Bug RotateQKVec4HalfStyle 对整个 Vec4 只按 head_dim_idx < half_dim 选择一次 paired offset 和符号,但 op 入口只校验 head_dim % 2 == 0。
当合法的偶数 head_dim 不是 8 的倍数时,Vec4 可能跨过 half_dim 边界。例如 head_dim=10 时 half_dim=5,head_dim_idx=4 的 chunk 同时包含第一/第二半区元素,却全部走 first-half 公式并用 load_idx + half_dim 读取 paired chunk,结果与 Python rotate_half() 不一致,甚至会错读边界元素。
建议修复方式:
在 launch 前限制 head_dim % 8 == 0 并回退到 native/scalar 路径,或在 kernel 中对跨 half/tail 的 chunk 逐元素计算 paired index 和符号。
| export CUCC_CMAKE_ENTRY=2 | ||
| export CUDA_PATH=${HOME}/cu-bridge/CUDA_DIR | ||
| export PATH=${CUDA_PATH}/bin:${MACA_PATH}/mxgpu_llvm/bin:${MACA_PATH}/bin:${CUCC_PATH}/tools:${CUCC_PATH}/bin:${PATH} | ||
| export LD_LIBRARY_PATH=${CUDA_PATH}/lib64:${MACA_PATH}/lib:${MACA_PATH}/mxgpu_llvm/lib:$LD_LIBRARY_PATH |
There was a problem hiding this comment.
🔴 Bug run_server.sh 和 run_benchmark.sh 都在 set -u 下 source 这个文件,直接展开未设置的 LD_LIBRARY_PATH 会让干净 shell 在这里以 unbound variable 退出。
同一初始化分支里第 4 行还把 pre_make 包在反引号中,fresh host 上会把 pre_make 的 stdout 当成命令执行。两个问题都会阻断新增脚本首次运行。
建议修复方式:
${MACA_PATH}/tools/cu-bridge/tools/pre_make
export LD_LIBRARY_PATH="${CUDA_PATH}/lib64:${MACA_PATH}/lib:${MACA_PATH}/mxgpu_llvm/lib:${LD_LIBRARY_PATH:-}"| # Fused RoPE kernel (rotate-half convention) compiled from apply_rope_qkv.cu. | ||
| # Replaces the Python fallback in SigLIP vision encoder (27 layers × 1 RoPE each, | ||
| # measured at 27.5% of vision encoder time — see optimization_log.md §0.1.1). | ||
| _ROPE_SO = os.path.join(os.path.dirname(__file__), "apply_rope_qkv_pd_.so") |
There was a problem hiding this comment.
🟡 建议 这里依赖 fastdeploy/model_executor/models/paddleocr_vl/apply_rope_qkv_pd_.so 已存在,但当前 PR 没有把 metax_ops/apply_rope_qkv.cu 加入 custom_ops/setup_ops.py 的 Metax sources,run_server.sh/run_benchmark.sh 也不会调用 scripts/build_rope_kernel.sh。
因此普通安装或直接运行新增脚本时 _MACA_FUSED_ROPE_OK 会保持 False,SigLIP RoPE 仍回落到 Python/native 路径,PR 描述的 fused RoPE 优化不会生效,且异常被吞掉不容易发现。
建议修复方式:
把该 op 纳入 Metax 常规 custom ops 构建/打包流程,或让新增启动/benchmark 脚本显式构建并校验 .so;加载失败时至少记录 warning,避免性能回退静默发生。
概述
基于 FastDeploy 引擎优化 PaddleOCR-VL 吞吐性能:通过 Profiling 定位 Prefill 与视觉 Token 处理瓶颈,实施自定义 MACA RoPE 算子、gelu_tanh 适配、Batch Size 解耦与 Eager Dispatch Cache,吞吐提升 24%。
主要优化
性能对比
性能分析截图
baseline (不同的机器可能有数值抖动)

优化版本
