Skip to content

[OP] Support qk rmsnorm fused && do_rope && write_cache for head_dim 192#8016

Merged
Jiang-Jia-Jun merged 5 commits into
PaddlePaddle:developfrom
zhoutianzi666:support_qk_rmsnorm_fused_192
Jun 9, 2026
Merged

[OP] Support qk rmsnorm fused && do_rope && write_cache for head_dim 192#8016
Jiang-Jia-Jun merged 5 commits into
PaddlePaddle:developfrom
zhoutianzi666:support_qk_rmsnorm_fused_192

Conversation

@zhoutianzi666

Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@codecov-commenter

codecov-commenter commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 48.51485% with 52 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@8ddd970). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...eploy/model_executor/ops/triton_ops/write_cache.py 45.45% 24 Missing ⚠️
...astdeploy/model_executor/ops/triton_ops/do_rope.py 45.00% 22 Missing ⚠️
...executor/ops/triton_ops/qk_rmsnorm_fused_kernel.py 25.00% 5 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #8016   +/-   ##
==========================================
  Coverage           ?   66.40%           
==========================================
  Files              ?      470           
  Lines              ?    66108           
  Branches           ?    10186           
==========================================
  Hits               ?    43896           
  Misses             ?    19420           
  Partials           ?     2792           
Flag Coverage Δ
GPU 76.30% <48.51%> (?)
XPU 7.01% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zhoutianzi666 zhoutianzi666 changed the title Support qk rmsnorm fused 192 [OP] Support qk rmsnorm fused 192 Jun 8, 2026
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-09 08:17:21 UTC+08:00

CI报告基于以下代码生成(30分钟更新一次):
PR commit: dfcf69f | Merge base: 8ddd970 (branch: develop)


1 Required任务 : 9/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 38 4 0 0 0
任务 错误类型 置信度 日志
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage PR问题:增量覆盖率 23% < 80% Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)

分析器: 通用分析(fallback)
失败用例: 无,单测阶段通过;失败发生在覆盖率阈值检查。

用例 错误摘要
Verify Code Coverage Threshold (80%) 增量覆盖率 23%,低于 80% 阈值;101 行需统计变更中 77 行未覆盖。

关键日志:

GPU Patch Coverage Details:
total_percent_covered: 23
total_num_lines: 101
total_num_violations: 77
num_changed_lines: 252
Process completed with exit code 9.
  • 根因摘要: 新增 Triton/attention 代码覆盖不足
    PR 新增 do_rope.pywrite_cache.pyonly_do_attn 分支,但当前测试没有执行这些路径。覆盖率 JSON 显示 write_cache.py 15.91%、do_rope.py 17.5%、qk_rmsnorm_fused_kernel.py 37.5%,导致 diff coverage 总体只有 23%。

修复建议:

  1. fastdeploy/model_executor/ops/triton_ops/do_rope.py:60fastdeploy/model_executor/ops/triton_ops/write_cache.py:61 添加算子级测试,构造 qkv_out、rotary emb、cache/page table 等输入并校验 RoPE 和 KV cache 写入结果。
  2. 扩展 tests/operators/test_qk_rmsnorm_fused.py 覆盖 head_dim=192 以及 head_dim_v != head_dim 场景,覆盖 qk_rmsnorm_fused_kernel.py:114:120 等新增逻辑。
  3. 增加 fastdeploy/model_executor/layers/attention/append_attn_backend.py:316only_do_attn=True 路径测试,确保会调用 do_ropewrite_cache

关联变更: fastdeploy/model_executor/ops/triton_ops/do_rope.pyfastdeploy/model_executor/ops/triton_ops/write_cache.pyfastdeploy/model_executor/ops/triton_ops/qk_rmsnorm_fused_kernel.pyfastdeploy/model_executor/layers/attention/append_attn_backend.pytests/operators/test_qk_rmsnorm_fused.py

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-09 09:17:28

📋 Review 摘要

PR 概述:支持 qk_rmsnorm_fused 的 192 head_dim,并新增 only_do_attn 下的 Triton RoPE/cache 写入路径。
变更范围:Triton OP、append attention、RoPE/cache 写入、相关算子/attention 测试。
影响面 Tag[OP] [KVCache]

问题

级别 文件 概述
🔴 Bug fastdeploy/model_executor/layers/attention/append_attn_backend.py:328 only_do_attn 预写 cache 后,默认 output 分支未透传 only_do_attn,仍会重复执行 C++ RoPE/write-cache
🟡 建议 fastdeploy/model_executor/layers/attention/append_attn_backend.py:639 192 / head_dim_v=128 的 baseline 校验被注释,测试不再验证结果
🟡 建议 tests/layers/test_attention_layer.py:429 新增无条件 return 让 Flash Attention 测试静默通过

历史 Findings 修复情况

Finding 问题 状态
F1 qk_rmsnorm_fused_kernel.py 权重 load 未对 padded offs_d 加 mask ⚠️ 仍存在
F2 test_qk_rmsnorm_fused.py 未覆盖本 PR 的 head_dim=192 padding 分支 ⚠️ 仍存在
F3 do_rope.py 仍按连续前后半维旋转,并直接使用 cos_emb.shape[-1] 作为 RoPE 维度 ⚠️ 仍存在
F4 do_rope.py 仍从 cache_k.shape[-1] 推导 head_dim_v ⚠️ 仍存在
F5 only_do_attn 分支仍在 cache quant 判断前取 cache,且 Triton write_cache 没有 scale/zero-point 支持 ⚠️ 仍存在

📝 PR 规范检查

标题已包含官方 [OP] Tag;PR 描述仍保留模板占位内容,Motivation / Modifications / Usage or Command / Accuracy Tests 未有效填写。

标题建议(可直接复制):

  • [OP] Support qk rmsnorm fused head_dim 192
PR 描述建议(点击展开,可直接复制)
## Motivation

Support `qk_rmsnorm_fused` for non-power-of-two head dimensions such as 192.

## Modifications

- Update the Triton `qk_rmsnorm_fused` kernel to use `triton.next_power_of_2(head_dim)` with dimension masks for Q/K loads and stores.
- Add `head_dim_v` handling and qkv layout shape validation in `qk_rmsnorm_fused`.
- Treat `qk_rmsnorm_fused` as an in-place operation in `QKRMSNorm.forward` and `tests/operators/test_qk_rmsnorm_fused.py`.

## Usage or Command

N/A

## Accuracy Tests

`tests/operators/test_qk_rmsnorm_fused.py::TestQKNorm::test_qk_norm_result` compares the Triton fused output with the Paddle reference path. The current diff does not provide a `head_dim=192` accuracy result.

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

当前仍有多个历史阻塞问题未修复;此外新 only_do_attn 预处理路径在默认 output 分支会与原 C++ RoPE/cache 写入重复执行,需要先修复后再合入。测试侧也需要恢复数值校验,避免以静默通过掩盖回归。

cache_v,
)

write_cache(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug only_do_attn 分支先调用 write_cache,但默认 self.use_output=True 时下面会走 append_attention_with_output,这个 C++ op 没有 only_do_attn 属性,内部仍按默认 false 执行原有 RoPE/write-cache 路径。

结果是同一个 qkv 在 output 分支会先被 Triton do_rope/write_cache 处理,再被 append_attention_with_output 里的 EncoderWriteCacheWithRopeKernel 处理一次,prefill cache 和 attention 输入都会被二次旋转/重写。

建议修复方式:把 only_do_attn 完整透传到 append_attention_with_output 的 Python 调用、C++ 函数签名、PD_BUILD_STATIC_OP attrs、InferShape/InferDtype,并在 AppendAttentionWithOutput 调用 AppendAttentionKernel 时传入该值;或者在 output 分支显式跳过原 C++ RoPE/write-cache 路径。


# print((res - res_baseline).abs().max())
assert (res - res_baseline).abs().max() <= 0.1
# assert (res - res_baseline).abs().max() <= 0.1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 这里把 head_dim_q=192head_dim_v=128 场景的唯一结果校验注释掉了,forward_unitest 仍构造 res_baseline,但后续不再比较 res 和 baseline。

这样新增的 RoPE/cache/write path 即使输出错误,测试也会通过。建议恢复该断言,或改成 np.testing.assert_allclose/paddle.allclose,并保留对 192 与不同 V head dim 的数值校验。

# p.stop()

def test_flash_attn_v3(self):
return

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 新增的无条件 return 会让这个测试在所有环境下静默通过,下面原有的 SM 条件 skipTest 和实际 forward 路径都不会执行;同样的模式也加到了 test_flash_attn_v3_with_masktest_flash_attn_v4

如果这些用例暂时不能跑,请改成带原因的 self.skipTest(...) 或在 pytest 配置中显式忽略;不要让测试函数直接返回成功,否则 Flash Attention 回归会被隐藏。

@zhoutianzi666 zhoutianzi666 changed the title [OP] Support qk rmsnorm fused 192 [OP] Support qk rmsnorm fused && do_rope && write_cache for head_dim 192 Jun 9, 2026

@chang-wenbin chang-wenbin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 03d837a into PaddlePaddle:develop Jun 9, 2026
39 of 43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants