[ExecuTorch][WebGPU] Add et_vk.apply_rotary_emb (interleaved RoPE) + ValueList multi-output by JulianCloudNTH · Pull Request #20264 · pytorch/executorch

JulianCloudNTH · 2026-06-13T00:08:51Z

Stack from ghstack (oldest at bottom):

[ExecuTorch][WebGPU] et_vk.prepack test suite (export + native golden) #20292
[ExecuTorch][WebGPU] Add et_vk.prepack (constant-tensor packing) for E2E weight loading #20265
[ExecuTorch][WebGPU] et_vk.apply_rotary_emb test suite (export + native golden) #20290
-> [ExecuTorch][WebGPU] Add et_vk.apply_rotary_emb (interleaved RoPE) + ValueList multi-output #20264
[ExecuTorch][WebGPU] et_vk.embedding_q4gsw test suite (export + native golden) #20289
[ExecuTorch][WebGPU] Add et_vk.embedding_q4gsw (4-bit groupwise-symmetric quantized embedding) #20263

Adds the WebGPU backend handler for et_vk.apply_rotary_emb.default (interleaved Llama rotary positional embedding) plus the ValueList graph-value support its multi-output signature requires.

The op rotates the query and key tensors by a shared freqs_cos/freqs_sin pair and is composed of two dispatches of one WGSL kernel: each thread handles one (even, odd) element pair of a head row (out[2i] = x[2i]*cos - x[2i+1]*sin, out[2i+1] = x[2i]*sin + x[2i+1]*cos), one dispatch writing xq_out and one writing xk_out, mirroring the Vulkan apply_rotary_emb reference (buffer-only, fp32, the interleaved .default variant). Each dispatch owns a distinct compute pipeline (the graph destructor releases per dispatch, so a shared handle would double-free); the workgroup size is a wg_size pipeline-override constant clamped to the device limit, both 1D dispatch counts go through WebGPUUtils::compute_1d_workgroup_count and are validated before any GPU-object allocation, and the embedded WGSL header is generated by gen_wgsl_headers.py.

The two outputs (xq_out, xk_out) are serialized by the Vulkan exporter as a single ValueList graph value, which the runtime did not previously model. This adds the ValueType::ValueList value kind, a value_lists_ table populated during build(), and a get_value_list accessor the handler uses to resolve the output ids. While in that code path it also closes a latent gap: a constant tensor whose constant_id is set but whose constants table is missing or out of range now throws (fail-loud) rather than silently leaving the buffer uninitialized.
@exported-using-ghexport

Differential Revision: D108428756

[ghstack-poisoned]

pytorch-bot · 2026-06-13T00:08:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20264

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 9b09f89 with merge base 0378fc4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-13T00:10:07Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]

JulianCloudNTH · 2026-06-15T19:03:14Z

@claude review

claude · 2026-06-15T19:04:09Z

Claude finished @JulianCloudNTH's task in 2m 34s —— View job

Code Review: `et_vk.apply_rotary_emb` + ValueList

Reviewed against origin/gh/JulianCloudNTH/26/base. Overall this is a clean, well-guarded addition — extensive shape/dtype validation, validate-before-allocate ordering, a faithful port of the interleaved Vulkan RoPE, and a thoughtfully designed test (NH≠NKV so an output swap is caught by numel). A few notes, none blocking.

Correctness — looks solid

Kernel indexing (rotary_embedding.wgsl): the pair → (b, s, head, pair_i) decomposition and base/freqs_idx arithmetic match the interleaved out[2i]=x[2i]·cos − x[2i+1]·sin contract, and freqs_idx = s·half_dim + pair_i correctly omits batch/head since freqs is shared. The pair >= num_pairs guard handles the workgroup-count rounding. Per-dispatch uniform buffers carry the distinct n_heads_q/n_heads_k, so the shared shader is correct for GQA. 👍
Validate-before-allocate: both compute_1d_workgroup_count calls run before any shader/pipeline/buffer creation, so a throw on an over-limit dispatch can't leak GPU objects. The clamped wg_size is used consistently for both the override constant and the workgroup-count math.
The WGPUBindGroupLayout is released (line 271) only after both bind groups are created — fine, since the bind groups retain it via refcount.

Notes / suggestions

Double pipeline creation (RotaryEmbedding.cpp:237-241). The comment "a shared handle would double-free" is accurate given the destructor (WebGPUGraph.cpp:188-195) calls wgpuComputePipelineRelease per dispatch. But creating two identical pipelines (same module, layout, constants) just to satisfy ownership is a bit wasteful — every other op goes through get_or_create_pipeline/pipeline_cache_, which the destructor frees separately (lines 201-204) and which dedupes. Consider routing through the cache instead of two raw wgpuDeviceCreateComputePipeline calls; then a single cached pipeline can back both dispatches without a double-free. Not blocking, but it would match the surrounding pattern and avoid the redundant compile.
Redundant num_pairs derivation (RotaryEmbedding.cpp:60 vs 184-193). add_rope_dispatch recomputes num_pairs = numel_of(out.dims)/2 while the caller independently computes workgroup_count from xq_numel/2. They're consistent only because output shape == input shape (which is asserted earlier), so it's safe — just worth a one-liner noting the invariant, since a future shape change would silently desync the guard from the dispatch size.
Fail-loud constant gap (WebGPUGraph.cpp:317-321). Good catch closing the out-of-range/missing-table case. One residual: the new else lives inside if (constant_id >= 0 && constant_data && tensor.nbytes > 0), so a tensor with constant_id >= 0 but a null constant_data pointer still silently leaves the buffer uninitialized — the same class of latent bug this PR is fixing. If that combination is meant to be unreachable, a brief comment would help; otherwise it's worth the same throw.
Native-test coverage (test_webgpu_native.cpp). The C++ kernel is only executed at the multi shape (S=5, NH=8, NKV=2). The decode shape (S=1, GQA 32:8) is exercised only in Python (export + golden-vs-eager), not against the actual GPU kernel. The S=1 path is exactly where the s = t2 % seq / batch decomposition is most likely to break, so running export_rope_model with shape_name="decode" and adding a second native case would close that gap. export_rope_model already takes shape_name, so this is mostly CI-script plumbing.
get_value_list bounds. value_lists_[id] is safe given resize(num_vals), and a non-ValueList id returns an empty vector → caught by the size() != 2 check. Fine. No validation that out_list[0]/[1] are themselves tensor ids before get_tensor, but a malformed id would fault there deterministically; acceptable.

Minor

RotaryParams carries three _pad u32s to reach 32 bytes; the static_assert is a nice guard. Matches the WGSL Params (8×u32).
WGSL .wgsl and the generated _wgsl.h are byte-identical and the wgsl-sha256 is present — assuming gen_wgsl_headers.py was rerun, good.

Nothing here blocks merge; items 1 and 4 are the most worth addressing.
· branch gh/JulianCloudNTH/26/head

[ghstack-poisoned]

JulianCloudNTH · 2026-06-15T21:59:27Z

@claude review

claude · 2026-06-15T22:00:15Z

Claude finished @JulianCloudNTH's task in 1m 55s —— View job

Code Review: `et_vk.apply_rotary_emb` + ValueList (re-review)

Reviewed against origin/gh/JulianCloudNTH/26/base. This is a clean, well-guarded addition. Since the prior review, one of the flagged items has been resolved in code; the rest still stand as optional polish. Nothing blocks merge.

Resolved since last review

Null constant_data gap (was item 3). WebGPUGraph.cpp:322-326 now throws when constant_id >= 0 && tensor.nbytes > 0 but constant_data is null, closing the silent-uninitialized-buffer case. Combined with the out-of-range/missing-table throw at 317-321, the constant path is now fail-loud across all branches. 👍

Correctness — solid

Kernel indexing (rotary_embedding.wgsl): the pair → (b, s, head, pair_i) decomposition and base/freqs_idx = s*half_dim + pair_i arithmetic faithfully implement interleaved out[2i]=x[2i]·cos − x[2i+1]·sin, with freqs correctly shared across batch/head. The pair >= num_pairs guard handles workgroup-count rounding.
Validate-before-allocate: both compute_1d_workgroup_count calls (RotaryEmbedding.cpp:185-194) run before any shader/pipeline/buffer creation, so a throw can't leak GPU objects. wg_size is clamped to the device limit and used consistently for both the override constant and the WGC math.
GQA: per-dispatch uniform buffers carry distinct n_heads_q/n_heads_k, so the shared shader is correct for grouped-query attention.
WGSL hash verified: sha256(rotary_embedding.wgsl) = c60f1ce1… matches the wgsl-sha256 in the generated header — gen_wgsl_headers.py output is current and byte-identical.
Cleanup ordering: bgl is released (:272) only after both bind groups are built; bind groups retain it by refcount. Pipelines are owned by their dispatches and freed in the graph dtor (WebGPUGraph.cpp:188-195).

Remaining optional notes (carried over, non-blocking)

Double pipeline creation (RotaryEmbedding.cpp:239-242). Two identical pipelines (same module, layout, constants) are compiled solely to satisfy per-dispatch ownership in the destructor. Every other op routes through get_or_create_pipeline/pipeline_cache_, which dedupes and is freed separately (WebGPUGraph.cpp:201-204). Routing through the cache would let one cached pipeline back both dispatches without the redundant compile. Fix this →
num_pairs derivation desync risk (RotaryEmbedding.cpp:61 vs 185-194). add_rope_dispatch recomputes num_pairs = numel_of(out.dims)/2, while the caller derives workgroup_count from xq_numel/2. These agree only because output shape == input shape (asserted at :170-180). Safe today; the existing comment at :60 documents the invariant.
Native-test coverage. The C++ kernel is exercised only at the multi shape (S=5). The decode shape (S=1, GQA 32:8) — exactly where the s = t2 % seq / batch decomposition is most fragile — is covered only in Python (export + golden). Adding a native case with shape_name="decode" would close that gap; export_rope_model already takes shape_name.
get_value_list member-id types. out_list[0]/[1] aren't validated as tensor ids before get_tensor; a malformed id faults deterministically there, so acceptable. The size() != 2 guard handles the wrong-arity case (a non-ValueList id returns an empty vector).

Summary: the resolved null-pointer throw is the meaningful change since last review. Item 1 (pipeline cache) is the only remaining suggestion with real upside; items 2-4 are documentation/coverage polish.
· branch gh/JulianCloudNTH/26/head

[ghstack-poisoned]

Update

f2d1ae0

[ghstack-poisoned]

JulianCloudNTH requested review from kirklandsign and larryliu0820 as code owners June 13, 2026 00:08

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 13, 2026

Update

5d188d6

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 15, 2026 17:23 — with GitHub Actions Inactive

meta-codesync Bot added the meta-exported label Jun 15, 2026

Update

0b7f637

[ghstack-poisoned]

JulianCloudNTH mentioned this pull request Jun 15, 2026

[ExecuTorch][WebGPU] et_vk.embedding_q4gsw test suite (export + native golden) #20289

Open

JulianCloudNTH temporarily deployed to cadence June 15, 2026 21:53 — with GitHub Actions Inactive

JulianCloudNTH mentioned this pull request Jun 15, 2026

[ExecuTorch][WebGPU] et_vk.apply_rotary_emb test suite (export + native golden) #20290

Open

Update

9b09f89

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 15, 2026 22:25 — with GitHub Actions Inactive

JulianCloudNTH mentioned this pull request Jun 15, 2026

[ExecuTorch][WebGPU] et_vk.prepack test suite (export + native golden) #20292

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ExecuTorch][WebGPU] Add et_vk.apply_rotary_emb (interleaved RoPE) + ValueList multi-output#20264

[ExecuTorch][WebGPU] Add et_vk.apply_rotary_emb (interleaved RoPE) + ValueList multi-output#20264
JulianCloudNTH wants to merge 4 commits into
gh/JulianCloudNTH/26/basefrom
gh/JulianCloudNTH/26/head

JulianCloudNTH commented Jun 13, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

JulianCloudNTH commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

JulianCloudNTH commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JulianCloudNTH commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20264

⏳ No Failures, 1 Pending

Uh oh!

github-actions Bot commented Jun 13, 2026

This PR needs a release notes: label

Uh oh!

JulianCloudNTH commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: et_vk.apply_rotary_emb + ValueList

Correctness — looks solid

Notes / suggestions

Minor

Uh oh!

JulianCloudNTH commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: et_vk.apply_rotary_emb + ValueList (re-review)

Resolved since last review

Correctness — solid

Remaining optional notes (carried over, non-blocking)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JulianCloudNTH commented Jun 13, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 13, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 15, 2026 •

edited

Loading

Code Review: `et_vk.apply_rotary_emb` + ValueList

claude Bot commented Jun 15, 2026 •

edited

Loading

Code Review: `et_vk.apply_rotary_emb` + ValueList (re-review)