[OP] Support qk rmsnorm fused && do_rope && write_cache for head_dim 192 by zhoutianzi666 · Pull Request #8016 · PaddlePaddle/FastDeploy

zhoutianzi666 · 2026-06-08T03:48:00Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

codecov-commenter · 2026-06-08T04:22:51Z

Codecov Report

❌ Patch coverage is 48.51485% with 52 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@8ddd970). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...eploy/model_executor/ops/triton_ops/write_cache.py	45.45%	24 Missing ⚠️
...astdeploy/model_executor/ops/triton_ops/do_rope.py	45.00%	22 Missing ⚠️
...executor/ops/triton_ops/qk_rmsnorm_fused_kernel.py	25.00%	5 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #8016   +/-   ##
==========================================
  Coverage           ?   66.40%           
==========================================
  Files              ?      470           
  Lines              ?    66108           
  Branches           ?    10186           
==========================================
  Hits               ?    43896           
  Misses             ?    19420           
  Partials           ?     2792

Flag	Coverage Δ
GPU	`76.30% <48.51%> (?)`
XPU	`7.01% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot · 2026-06-09T00:18:48Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-09 08:17:21 UTC+08:00

CI报告基于以下代码生成（30分钟更新一次）:
PR commit: dfcf69f | Merge base: 8ddd970 (branch: develop)

1 Required任务 : 9/10 通过

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
42(0)	42	38	4	0	0	0

任务	错误类型	置信度	日志
`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	PR问题：增量覆盖率 23% < 80%	高	Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题（置信度: 高）

分析器: 通用分析(fallback)
失败用例: 无，单测阶段通过；失败发生在覆盖率阈值检查。

用例	错误摘要
`Verify Code Coverage Threshold (80%)`	增量覆盖率 23%，低于 80% 阈值；101 行需统计变更中 77 行未覆盖。

关键日志:

GPU Patch Coverage Details:
total_percent_covered: 23
total_num_lines: 101
total_num_violations: 77
num_changed_lines: 252
Process completed with exit code 9.

根因摘要: 新增 Triton/attention 代码覆盖不足
PR 新增 do_rope.py、write_cache.py 和 only_do_attn 分支，但当前测试没有执行这些路径。覆盖率 JSON 显示 write_cache.py 15.91%、do_rope.py 17.5%、qk_rmsnorm_fused_kernel.py 37.5%，导致 diff coverage 总体只有 23%。

修复建议:

为 fastdeploy/model_executor/ops/triton_ops/do_rope.py:60 和 fastdeploy/model_executor/ops/triton_ops/write_cache.py:61 添加算子级测试，构造 qkv_out、rotary emb、cache/page table 等输入并校验 RoPE 和 KV cache 写入结果。
扩展 tests/operators/test_qk_rmsnorm_fused.py 覆盖 head_dim=192 以及 head_dim_v != head_dim 场景，覆盖 qk_rmsnorm_fused_kernel.py:114、:120 等新增逻辑。
增加 fastdeploy/model_executor/layers/attention/append_attn_backend.py:316 的 only_do_attn=True 路径测试，确保会调用 do_rope 与 write_cache。

关联变更: fastdeploy/model_executor/ops/triton_ops/do_rope.py、fastdeploy/model_executor/ops/triton_ops/write_cache.py、fastdeploy/model_executor/ops/triton_ops/qk_rmsnorm_fused_kernel.py、fastdeploy/model_executor/layers/attention/append_attn_backend.py、tests/operators/test_qk_rmsnorm_fused.py

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-06-09 09:17:28

📋 Review 摘要

PR 概述：支持 qk_rmsnorm_fused 的 192 head_dim，并新增 only_do_attn 下的 Triton RoPE/cache 写入路径。
变更范围：Triton OP、append attention、RoPE/cache 写入、相关算子/attention 测试。
影响面 Tag：[OP] [KVCache]

问题

级别	文件	概述
🔴 Bug	`fastdeploy/model_executor/layers/attention/append_attn_backend.py:328`	`only_do_attn` 预写 cache 后，默认 output 分支未透传 `only_do_attn`，仍会重复执行 C++ RoPE/write-cache
🟡 建议	`fastdeploy/model_executor/layers/attention/append_attn_backend.py:639`	192 / `head_dim_v=128` 的 baseline 校验被注释，测试不再验证结果
🟡 建议	`tests/layers/test_attention_layer.py:429`	新增无条件 `return` 让 Flash Attention 测试静默通过

历史 Findings 修复情况

Finding	问题	状态
F1	`qk_rmsnorm_fused_kernel.py` 权重 load 未对 padded `offs_d` 加 mask	⚠️ 仍存在
F2	`test_qk_rmsnorm_fused.py` 未覆盖本 PR 的 `head_dim=192` padding 分支	⚠️ 仍存在
F3	`do_rope.py` 仍按连续前后半维旋转，并直接使用 `cos_emb.shape[-1]` 作为 RoPE 维度	⚠️ 仍存在
F4	`do_rope.py` 仍从 `cache_k.shape[-1]` 推导 `head_dim_v`	⚠️ 仍存在
F5	`only_do_attn` 分支仍在 cache quant 判断前取 cache，且 Triton `write_cache` 没有 scale/zero-point 支持	⚠️ 仍存在

📝 PR 规范检查

标题已包含官方 [OP] Tag；PR 描述仍保留模板占位内容，Motivation / Modifications / Usage or Command / Accuracy Tests 未有效填写。

标题建议（可直接复制）：

[OP] Support qk rmsnorm fused head_dim 192

PR 描述建议（点击展开，可直接复制）

## Motivation

Support `qk_rmsnorm_fused` for non-power-of-two head dimensions such as 192.

## Modifications

- Update the Triton `qk_rmsnorm_fused` kernel to use `triton.next_power_of_2(head_dim)` with dimension masks for Q/K loads and stores.
- Add `head_dim_v` handling and qkv layout shape validation in `qk_rmsnorm_fused`.
- Treat `qk_rmsnorm_fused` as an in-place operation in `QKRMSNorm.forward` and `tests/operators/test_qk_rmsnorm_fused.py`.

## Usage or Command

N/A

## Accuracy Tests

`tests/operators/test_qk_rmsnorm_fused.py::TestQKNorm::test_qk_norm_result` compares the Triton fused output with the Paddle reference path. The current diff does not provide a `head_dim=192` accuracy result.

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

当前仍有多个历史阻塞问题未修复；此外新 only_do_attn 预处理路径在默认 output 分支会与原 C++ RoPE/cache 写入重复执行，需要先修复后再合入。测试侧也需要恢复数值校验，避免以静默通过掩盖回归。

PaddlePaddle-bot · 2026-06-09T01:19:01Z

+                cache_v,
+            )
+
+            write_cache(


🔴 Bug only_do_attn 分支先调用 write_cache，但默认 self.use_output=True 时下面会走 append_attention_with_output，这个 C++ op 没有 only_do_attn 属性，内部仍按默认 false 执行原有 RoPE/write-cache 路径。

结果是同一个 qkv 在 output 分支会先被 Triton do_rope/write_cache 处理，再被 append_attention_with_output 里的 EncoderWriteCacheWithRopeKernel 处理一次，prefill cache 和 attention 输入都会被二次旋转/重写。

建议修复方式：把 only_do_attn 完整透传到 append_attention_with_output 的 Python 调用、C++ 函数签名、PD_BUILD_STATIC_OP attrs、InferShape/InferDtype，并在 AppendAttentionWithOutput 调用 AppendAttentionKernel 时传入该值；或者在 output 分支显式跳过原 C++ RoPE/write-cache 路径。

PaddlePaddle-bot · 2026-06-09T01:19:01Z


        # print((res - res_baseline).abs().max())
-        assert (res - res_baseline).abs().max() <= 0.1
+        # assert (res - res_baseline).abs().max() <= 0.1


🟡 建议 这里把 head_dim_q=192、head_dim_v=128 场景的唯一结果校验注释掉了，forward_unitest 仍构造 res_baseline，但后续不再比较 res 和 baseline。

这样新增的 RoPE/cache/write path 即使输出错误，测试也会通过。建议恢复该断言，或改成 np.testing.assert_allclose/paddle.allclose，并保留对 192 与不同 V head dim 的数值校验。

PaddlePaddle-bot · 2026-06-09T01:19:01Z

        # p.stop()

    def test_flash_attn_v3(self):
+        return


🟡 建议 新增的无条件 return 会让这个测试在所有环境下静默通过，下面原有的 SM 条件 skipTest 和实际 forward 路径都不会执行；同样的模式也加到了 test_flash_attn_v3_with_mask 和 test_flash_attn_v4。

如果这些用例暂时不能跑，请改成带原因的 self.skipTest(...) 或在 pytest 配置中显式忽略；不要让测试函数直接返回成功，否则 Flash Attention 回归会被隐藏。

chang-wenbin

LGTM

zhoutianzi666 added 2 commits June 8, 2026 11:34

commit

5f4f7ef

commit

d2c065c

zhoutianzi666 temporarily deployed to Metax_ci June 8, 2026 03:48 — with GitHub Actions Inactive

zhoutianzi666 changed the title ~~Support qk rmsnorm fused 192~~ [OP] Support qk rmsnorm fused 192 Jun 8, 2026

This comment was marked as outdated.

Sign in to view

commit

63f848c

zhoutianzi666 had a problem deploying to Metax_ci June 8, 2026 13:45 — with GitHub Actions Error

commit

dfcf69f

zhoutianzi666 had a problem deploying to Metax_ci June 8, 2026 13:48 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

commit

1583d27

zhoutianzi666 had a problem deploying to Metax_ci June 9, 2026 01:08 — with GitHub Actions Failure

PaddlePaddle-bot suggested changes Jun 9, 2026

View reviewed changes

zhoutianzi666 changed the title ~~[OP] Support qk rmsnorm fused 192~~ [OP] Support qk rmsnorm fused && do_rope && write_cache for head_dim 192 Jun 9, 2026

chang-wenbin approved these changes Jun 9, 2026

View reviewed changes

Jiang-Jia-Jun merged commit 03d837a into PaddlePaddle:develop Jun 9, 2026
39 of 43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OP] Support qk rmsnorm fused && do_rope && write_cache for head_dim 192#8016

[OP] Support qk rmsnorm fused && do_rope && write_cache for head_dim 192#8016
Jiang-Jia-Jun merged 5 commits into
PaddlePaddle:developfrom
zhoutianzi666:support_qk_rmsnorm_fused_192

zhoutianzi666 commented Jun 8, 2026

Uh oh!

codecov-commenter commented Jun 8, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented Jun 9, 2026

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot Jun 9, 2026

Uh oh!

PaddlePaddle-bot Jun 9, 2026

Uh oh!

PaddlePaddle-bot Jun 9, 2026

Uh oh!

chang-wenbin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

zhoutianzi666 commented Jun 8, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

codecov-commenter commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented Jun 9, 2026

1 Required任务 : 9/10 通过

2 失败详情

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

历史 Findings 修复情况

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

chang-wenbin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Jun 8, 2026 •

edited

Loading