Skip to content

feat(metax): SigLIP vision encoder optimizations (+24% throughput)#8068

Open
valorix25 wants to merge 1 commit into
PaddlePaddle:release/2.5from
valorix25:metax-siglip-optimizations
Open

feat(metax): SigLIP vision encoder optimizations (+24% throughput)#8068
valorix25 wants to merge 1 commit into
PaddlePaddle:release/2.5from
valorix25:metax-siglip-optimizations

Conversation

@valorix25

@valorix25 valorix25 commented Jun 22, 2026

Copy link
Copy Markdown

概述

基于 FastDeploy 引擎优化 PaddleOCR-VL 吞吐性能:通过 Profiling 定位 Prefill 与视觉 Token 处理瓶颈,实施自定义 MACA RoPE 算子、gelu_tanh 适配、Batch Size 解耦与 Eager Dispatch Cache,吞吐提升 24%。

主要优化

  • Fused RoPE Kernel: 自定义 MACA RoPE 算子(rotate-half convention),替换 Python fallback,RoPE 操作加速 1.45x
  • gelu_tanh 适配: 在 siglip_ops.py 中添加 Metax 原生支持,避免 27 层 MLP 的 Python 回退
  • Batch Size 解耦: 小 batch_size 创建更多任务,服务器喂数据更平滑(+7.5%)
  • Eager Dispatch Cache: 缓存 RoPE 和激活函数的平台分发结果,消除热路径上的重复检查

性能对比

指标 优化前 优化后 提升
吞吐量 0.59 pages/sec 0.715 pages/sec +24%

性能分析截图

baseline (不同的机器可能有数值抖动)
Snipaste_2026-06-18_19-52-11

优化版本
Snipaste_2026-06-21_15-55-38

@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


valorix25 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@valorix25 valorix25 force-pushed the metax-siglip-optimizations branch from 6b1d7de to b2f8fe0 Compare June 22, 2026 15:46
@valorix25 valorix25 changed the title feat(metax): SigLIP vision encoder optimizations (+21% throughput) feat(metax): SigLIP vision encoder optimizations (+24% throughput) Jun 22, 2026

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-23 00:18:01

📋 Review 摘要

PR 概述:为 PaddleOCR-VL/SigLIP 在 Metax 上新增 fused RoPE/GELU 优化及 benchmark/server 脚本

变更范围:PaddleOCR-VL 模型热路径、Metax custom op、输入预处理阈值、Metax runner、benchmark/build/run 脚本

影响面 TagModels OP Metax DataProcessor Benchmark

问题

级别 文件 概述
🔴 Bug custom_ops/metax_ops/apply_rope_qkv.cu:87 rotate-half Vec4 路径允许 head_dim 只满足偶数,非 8 倍数时会跨 half 边界算错
🔴 Bug scripts/envSetup.sh:11 新增 run 脚本在 set -u 下 source 时,未设置 LD_LIBRARY_PATH 会直接退出
🟡 建议 fastdeploy/model_executor/models/paddleocr_vl/siglip_ops.py:36 fused RoPE .so 没有常规构建/打包入口,默认运行会静默回退到 native 路径

📝 PR 规范检查

PR 标题未使用 FastDeploy 官方 Tag/Cherry-Pick 格式,描述也缺少 §D2 要求的 MotivationModificationsUsage or CommandAccuracy TestsChecklist 结构。

标题建议(可直接复制):

  • [Cherry-Pick][Metax] Optimize SigLIP vision encoder for PaddleOCR-VL(#8068)
PR 描述建议(点击展开,可直接复制)
## Motivation
Optimize PaddleOCR-VL SigLIP vision encoder performance on Metax C500 and provide scripts for building/running the benchmark.

## Modifications
- Add a Metax fused RoPE dispatch path for PaddleOCR-VL SigLIP and enable fused `gelu_tanh` on Metax.
- Update the Metax `apply_rope_qkv` kernel to use rotate-half RoPE convention.
- Adjust PaddleOCR-VL image resize aspect-ratio limit from 200 to 300.
- Add PaddleOCR-VL FastDeploy benchmark configuration and multi-process benchmark script.
- Add Metax environment, server, benchmark, and standalone RoPE kernel build scripts.

## Usage or Command
    bash scripts/build_rope_kernel.sh
    bash scripts/run_server.sh
    bash scripts/run_benchmark.sh

## Accuracy Tests
N/A(当前 PR 描述/diff 未提供精度对齐结果;标题仅提到 +24% throughput)

## Checklist

- [ ] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

实现方向清晰,但当前 fused RoPE kernel 对 head_dim 的合法值检查不足,新增脚本在常见干净环境会启动失败;另外 fused RoPE 没有纳入常规构建路径,容易让优化在实际运行中静默失效。建议修复上述问题后再合入。

VecT rot_half_vec = {-qk_vec[1], qk_vec[0], -qk_vec[3], qk_vec[2]};
// Load the paired Vec4 from the other half of head_dim for rotate_half.
// pair_load = load_idx + half_dim if first half, load_idx - half_dim if second.
const int pair_load =

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug RotateQKVec4HalfStyle 对整个 Vec4 只按 head_dim_idx < half_dim 选择一次 paired offset 和符号,但 op 入口只校验 head_dim % 2 == 0

当合法的偶数 head_dim 不是 8 的倍数时,Vec4 可能跨过 half_dim 边界。例如 head_dim=10half_dim=5head_dim_idx=4 的 chunk 同时包含第一/第二半区元素,却全部走 first-half 公式并用 load_idx + half_dim 读取 paired chunk,结果与 Python rotate_half() 不一致,甚至会错读边界元素。

建议修复方式:
在 launch 前限制 head_dim % 8 == 0 并回退到 native/scalar 路径,或在 kernel 中对跨 half/tail 的 chunk 逐元素计算 paired index 和符号。

Comment thread scripts/envSetup.sh
export CUCC_CMAKE_ENTRY=2
export CUDA_PATH=${HOME}/cu-bridge/CUDA_DIR
export PATH=${CUDA_PATH}/bin:${MACA_PATH}/mxgpu_llvm/bin:${MACA_PATH}/bin:${CUCC_PATH}/tools:${CUCC_PATH}/bin:${PATH}
export LD_LIBRARY_PATH=${CUDA_PATH}/lib64:${MACA_PATH}/lib:${MACA_PATH}/mxgpu_llvm/lib:$LD_LIBRARY_PATH

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug run_server.shrun_benchmark.sh 都在 set -u 下 source 这个文件,直接展开未设置的 LD_LIBRARY_PATH 会让干净 shell 在这里以 unbound variable 退出。

同一初始化分支里第 4 行还把 pre_make 包在反引号中,fresh host 上会把 pre_make 的 stdout 当成命令执行。两个问题都会阻断新增脚本首次运行。

建议修复方式:

${MACA_PATH}/tools/cu-bridge/tools/pre_make
export LD_LIBRARY_PATH="${CUDA_PATH}/lib64:${MACA_PATH}/lib:${MACA_PATH}/mxgpu_llvm/lib:${LD_LIBRARY_PATH:-}"

# Fused RoPE kernel (rotate-half convention) compiled from apply_rope_qkv.cu.
# Replaces the Python fallback in SigLIP vision encoder (27 layers × 1 RoPE each,
# measured at 27.5% of vision encoder time — see optimization_log.md §0.1.1).
_ROPE_SO = os.path.join(os.path.dirname(__file__), "apply_rope_qkv_pd_.so")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 这里依赖 fastdeploy/model_executor/models/paddleocr_vl/apply_rope_qkv_pd_.so 已存在,但当前 PR 没有把 metax_ops/apply_rope_qkv.cu 加入 custom_ops/setup_ops.py 的 Metax sources,run_server.sh/run_benchmark.sh 也不会调用 scripts/build_rope_kernel.sh

因此普通安装或直接运行新增脚本时 _MACA_FUSED_ROPE_OK 会保持 False,SigLIP RoPE 仍回落到 Python/native 路径,PR 描述的 fused RoPE 优化不会生效,且异常被吞掉不容易发现。

建议修复方式:
把该 op 纳入 Metax 常规 custom ops 构建/打包流程,或让新增启动/benchmark 脚本显式构建并校验 .so;加载失败时至少记录 warning,避免性能回退静默发生。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants