[OP] Support qk rmsnorm fused && do_rope && write_cache for head_dim 192#8016
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #8016 +/- ##
==========================================
Coverage ? 66.40%
==========================================
Files ? 470
Lines ? 66108
Branches ? 10186
==========================================
Hits ? 43896
Misses ? 19420
Partials ? 2792
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
CI报告基于以下代码生成(30分钟更新一次): 1 Required任务 : 9/10 通过
2 失败详情🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)分析器: 通用分析(fallback)
关键日志:
修复建议:
关联变更: |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-09 09:17:28
📋 Review 摘要
PR 概述:支持 qk_rmsnorm_fused 的 192 head_dim,并新增 only_do_attn 下的 Triton RoPE/cache 写入路径。
变更范围:Triton OP、append attention、RoPE/cache 写入、相关算子/attention 测试。
影响面 Tag:[OP] [KVCache]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/model_executor/layers/attention/append_attn_backend.py:328 |
only_do_attn 预写 cache 后,默认 output 分支未透传 only_do_attn,仍会重复执行 C++ RoPE/write-cache |
| 🟡 建议 | fastdeploy/model_executor/layers/attention/append_attn_backend.py:639 |
192 / head_dim_v=128 的 baseline 校验被注释,测试不再验证结果 |
| 🟡 建议 | tests/layers/test_attention_layer.py:429 |
新增无条件 return 让 Flash Attention 测试静默通过 |
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | qk_rmsnorm_fused_kernel.py 权重 load 未对 padded offs_d 加 mask |
|
| F2 | test_qk_rmsnorm_fused.py 未覆盖本 PR 的 head_dim=192 padding 分支 |
|
| F3 | do_rope.py 仍按连续前后半维旋转,并直接使用 cos_emb.shape[-1] 作为 RoPE 维度 |
|
| F4 | do_rope.py 仍从 cache_k.shape[-1] 推导 head_dim_v |
|
| F5 | only_do_attn 分支仍在 cache quant 判断前取 cache,且 Triton write_cache 没有 scale/zero-point 支持 |
📝 PR 规范检查
标题已包含官方 [OP] Tag;PR 描述仍保留模板占位内容,Motivation / Modifications / Usage or Command / Accuracy Tests 未有效填写。
标题建议(可直接复制):
[OP] Support qk rmsnorm fused head_dim 192
PR 描述建议(点击展开,可直接复制)
## Motivation
Support `qk_rmsnorm_fused` for non-power-of-two head dimensions such as 192.
## Modifications
- Update the Triton `qk_rmsnorm_fused` kernel to use `triton.next_power_of_2(head_dim)` with dimension masks for Q/K loads and stores.
- Add `head_dim_v` handling and qkv layout shape validation in `qk_rmsnorm_fused`.
- Treat `qk_rmsnorm_fused` as an in-place operation in `QKRMSNorm.forward` and `tests/operators/test_qk_rmsnorm_fused.py`.
## Usage or Command
N/A
## Accuracy Tests
`tests/operators/test_qk_rmsnorm_fused.py::TestQKNorm::test_qk_norm_result` compares the Triton fused output with the Paddle reference path. The current diff does not provide a `head_dim=192` accuracy result.
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
当前仍有多个历史阻塞问题未修复;此外新 only_do_attn 预处理路径在默认 output 分支会与原 C++ RoPE/cache 写入重复执行,需要先修复后再合入。测试侧也需要恢复数值校验,避免以静默通过掩盖回归。
| cache_v, | ||
| ) | ||
|
|
||
| write_cache( |
There was a problem hiding this comment.
🔴 Bug only_do_attn 分支先调用 write_cache,但默认 self.use_output=True 时下面会走 append_attention_with_output,这个 C++ op 没有 only_do_attn 属性,内部仍按默认 false 执行原有 RoPE/write-cache 路径。
结果是同一个 qkv 在 output 分支会先被 Triton do_rope/write_cache 处理,再被 append_attention_with_output 里的 EncoderWriteCacheWithRopeKernel 处理一次,prefill cache 和 attention 输入都会被二次旋转/重写。
建议修复方式:把 only_do_attn 完整透传到 append_attention_with_output 的 Python 调用、C++ 函数签名、PD_BUILD_STATIC_OP attrs、InferShape/InferDtype,并在 AppendAttentionWithOutput 调用 AppendAttentionKernel 时传入该值;或者在 output 分支显式跳过原 C++ RoPE/write-cache 路径。
|
|
||
| # print((res - res_baseline).abs().max()) | ||
| assert (res - res_baseline).abs().max() <= 0.1 | ||
| # assert (res - res_baseline).abs().max() <= 0.1 |
There was a problem hiding this comment.
🟡 建议 这里把 head_dim_q=192、head_dim_v=128 场景的唯一结果校验注释掉了,forward_unitest 仍构造 res_baseline,但后续不再比较 res 和 baseline。
这样新增的 RoPE/cache/write path 即使输出错误,测试也会通过。建议恢复该断言,或改成 np.testing.assert_allclose/paddle.allclose,并保留对 192 与不同 V head dim 的数值校验。
| # p.stop() | ||
|
|
||
| def test_flash_attn_v3(self): | ||
| return |
There was a problem hiding this comment.
🟡 建议 新增的无条件 return 会让这个测试在所有环境下静默通过,下面原有的 SM 条件 skipTest 和实际 forward 路径都不会执行;同样的模式也加到了 test_flash_attn_v3_with_mask 和 test_flash_attn_v4。
如果这些用例暂时不能跑,请改成带原因的 self.skipTest(...) 或在 pytest 配置中显式忽略;不要让测试函数直接返回成功,否则 Flash Attention 回归会被隐藏。
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.