Bug fixes for missing kv_cache_buffer by quic-rishinr · Pull Request #1082 · quic/efficient-transformers

quic-rishinr · 2026-06-15T13:56:34Z

Background

PR #1046 introduced the optional kv_cache_prefix parameter for
QEFFAutoModelForCausalLM, single-QPC VLMs, and dual-QPC VLMs. When set,
it injects an infix token into LLM KV-cache retained-state buffer names:

past_key.N_RetainedState  →  past_key.N_<prefix>_RetainedState

This gives vLLM a stable regex handle to select only LLM KV buffers for
disaggregated device-to-device transfer, leaving vision buffers
(vision_embeds_RetainedState, etc.) untouched.

The feature worked for non-layerwise paths. This work extends it to the
layerwise E-P-D path used by large models (e.g. Qwen3.5-122B-A10B).

Bugs Fixed

Bug 1 — `kv_cache_prefix` dropped in layerwise `compile()` and `export()`

File: QEfficient/transformers/models/modeling_auto.py

_QEffAutoModelForImageTextToTextDualQPC.compile() and .export() both
accept kv_cache_prefix as a named parameter, but neither forwarded it
when taking the layerwise=True early-return branch:

compile() → _run_layerwise_compile(...) — missing
compile() → vision_wrapper.compile(...) (vision-only branch) — missing
export() → _run_layerwise_export(...) — missing (explicit named param,
not captured by **kwargs)

Fix: Added kv_cache_prefix=kv_cache_prefix to all three call sites.

Bug 2 — `_export_layerwise` rebuilt output names from scratch

File: QEfficient/base/modeling_qeff.py

_export_layerwise constructed its own output_name list using hardcoded
templates:

output_name.append(f"past_key.{layer_idx}_RetainedState")
output_name.append(f"past_value.{layer_idx}_RetainedState")

It never consulted the output_names parameter passed in by the caller,
which already carried the prefixed names from apply_kv_cache_prefix.
The per-window ONNXes were therefore exported with plain names, and the
stitched merged ONNX had no prefix — making kv_cache_prefix silently a
no-op for all layerwise exports.

Fix: Built a _caller_retained_map from the incoming output_names:
for each KV retained-state name it finds the plain stem (e.g.
past_key.3_vllmKvCache_RetainedState → plain stem past_key.3),
then registers the prefixed+suffix-corrected form under both
_RetainedState and _InternalRetainedState keys. When appending each
KV buffer to output_name, the map is consulted first; the plain default
is used as fallback (preserving existing behaviour when no prefix is set).

Bug 3 — KV input names not aligned in `_export_layerwise`

File: QEfficient/base/modeling_qeff.py

The regular _export path calls align_kv_input_names_to_retained_outputs
to rename KV input buffers to match the prefixed outputs:

past_key.3  →  past_key.3_vllmKvCache

The AIC compiler pairs an output X_RetainedState to the input named X;
without this rename the compiler cannot establish KV retention for prefixed
buffers. _export_layerwise skipped this alignment entirely.

Fix: Added the same align_kv_input_names_to_retained_outputs call
(with dynamic-axes propagation) immediately before torch.onnx.export in
_export_layerwise.

Files Changed

File	Change
`QEfficient/transformers/models/modeling_auto.py`	Forward `kv_cache_prefix` in all layerwise `compile()` and `export()` branches
`QEfficient/base/modeling_qeff.py`	Respect caller's prefixed `output_names` and align input names in `_export_layerwise`
`examples/image_text_to_text/models/qwen3_5_moe/qwen3_5_122b_epd_disagg.py`	New end-to-end E-P-D disaggregated inference example for Qwen3.5-122B-A10B

Example Script Notes (`qwen3_5_122b_epd_disagg.py`)

Three separate compile calls: vision encoder, lang prefill, lang decode
kv_infix computed once from KV_CACHE_PREFIX; _update_retained_states
uses it to feed the correct buffer names back as inputs for the next step

Runtime Behaviour

No prefix set: all paths are no-ops; buffer names unchanged
Prefix set, non-layerwise: unchanged from PR feat(0506): Add optional KV-cache buffer-name prefix for vLLM disaggr… #1046
Prefix set, layerwise: per-window ONNXes are now exported with prefixed output names and aligned input names; the stitched merged ONNX carries them through to the compiled QPC

Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>

quic-sanising · 2026-06-15T20:05:28Z

LGTM!

Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>

Fix for prefix caching using layerwise

0b0cad6

Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>

quic-rishinr requested a review from vbaddi June 15, 2026 14:21

vbaddi approved these changes Jun 15, 2026

View reviewed changes

extended fix conv_state, recurrent_state and QEFFAutoModelForCausalLM

a42495e

Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>

quic-rishinr mentioned this pull request Jun 15, 2026

Bug fixes for missing kv_cache_buffer #1077

Closed

quic-rishinr merged commit b1856f7 into main Jun 16, 2026
6 of 8 checks passed

quic-rishinr deleted the bug_fix_kv_cache_buffer-mainline branch June 16, 2026 04:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fixes for missing kv_cache_buffer#1082

Bug fixes for missing kv_cache_buffer#1082
quic-rishinr merged 2 commits into
mainfrom
bug_fix_kv_cache_buffer-mainline

quic-rishinr commented Jun 15, 2026

Uh oh!

quic-sanising commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

quic-rishinr commented Jun 15, 2026

Background

Bugs Fixed

Bug 1 — kv_cache_prefix dropped in layerwise compile() and export()

Bug 2 — _export_layerwise rebuilt output names from scratch

Bug 3 — KV input names not aligned in _export_layerwise

Files Changed

Example Script Notes (qwen3_5_122b_epd_disagg.py)

Runtime Behaviour

Uh oh!

quic-sanising commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Bug 1 — `kv_cache_prefix` dropped in layerwise `compile()` and `export()`

Bug 2 — `_export_layerwise` rebuilt output names from scratch

Bug 3 — KV input names not aligned in `_export_layerwise`

Example Script Notes (`qwen3_5_122b_epd_disagg.py`)