Skip to content

Bug fixes for missing kv_cache_buffer#1082

Merged
quic-rishinr merged 2 commits into
mainfrom
bug_fix_kv_cache_buffer-mainline
Jun 16, 2026
Merged

Bug fixes for missing kv_cache_buffer#1082
quic-rishinr merged 2 commits into
mainfrom
bug_fix_kv_cache_buffer-mainline

Conversation

@quic-rishinr

Copy link
Copy Markdown
Contributor

Background

PR #1046 introduced the optional kv_cache_prefix parameter for
QEFFAutoModelForCausalLM, single-QPC VLMs, and dual-QPC VLMs. When set,
it injects an infix token into LLM KV-cache retained-state buffer names:

past_key.N_RetainedState  →  past_key.N_<prefix>_RetainedState

This gives vLLM a stable regex handle to select only LLM KV buffers for
disaggregated device-to-device transfer, leaving vision buffers
(vision_embeds_RetainedState, etc.) untouched.

The feature worked for non-layerwise paths. This work extends it to the
layerwise E-P-D path used by large models (e.g. Qwen3.5-122B-A10B).


Bugs Fixed

Bug 1 — kv_cache_prefix dropped in layerwise compile() and export()

File: QEfficient/transformers/models/modeling_auto.py

_QEffAutoModelForImageTextToTextDualQPC.compile() and .export() both
accept kv_cache_prefix as a named parameter, but neither forwarded it
when taking the layerwise=True early-return branch:

  • compile()_run_layerwise_compile(...) — missing
  • compile()vision_wrapper.compile(...) (vision-only branch) — missing
  • export()_run_layerwise_export(...) — missing (explicit named param,
    not captured by **kwargs)

Fix: Added kv_cache_prefix=kv_cache_prefix to all three call sites.


Bug 2 — _export_layerwise rebuilt output names from scratch

File: QEfficient/base/modeling_qeff.py

_export_layerwise constructed its own output_name list using hardcoded
templates:

output_name.append(f"past_key.{layer_idx}_RetainedState")
output_name.append(f"past_value.{layer_idx}_RetainedState")

It never consulted the output_names parameter passed in by the caller,
which already carried the prefixed names from apply_kv_cache_prefix.
The per-window ONNXes were therefore exported with plain names, and the
stitched merged ONNX had no prefix — making kv_cache_prefix silently a
no-op for all layerwise exports.

Fix: Built a _caller_retained_map from the incoming output_names:
for each KV retained-state name it finds the plain stem (e.g.
past_key.3_vllmKvCache_RetainedState → plain stem past_key.3),
then registers the prefixed+suffix-corrected form under both
_RetainedState and _InternalRetainedState keys. When appending each
KV buffer to output_name, the map is consulted first; the plain default
is used as fallback (preserving existing behaviour when no prefix is set).


Bug 3 — KV input names not aligned in _export_layerwise

File: QEfficient/base/modeling_qeff.py

The regular _export path calls align_kv_input_names_to_retained_outputs
to rename KV input buffers to match the prefixed outputs:

past_key.3  →  past_key.3_vllmKvCache

The AIC compiler pairs an output X_RetainedState to the input named X;
without this rename the compiler cannot establish KV retention for prefixed
buffers. _export_layerwise skipped this alignment entirely.

Fix: Added the same align_kv_input_names_to_retained_outputs call
(with dynamic-axes propagation) immediately before torch.onnx.export in
_export_layerwise.


Files Changed

File Change
QEfficient/transformers/models/modeling_auto.py Forward kv_cache_prefix in all layerwise compile() and export() branches
QEfficient/base/modeling_qeff.py Respect caller's prefixed output_names and align input names in _export_layerwise
examples/image_text_to_text/models/qwen3_5_moe/qwen3_5_122b_epd_disagg.py New end-to-end E-P-D disaggregated inference example for Qwen3.5-122B-A10B

Example Script Notes (qwen3_5_122b_epd_disagg.py)

  • Three separate compile calls: vision encoder, lang prefill, lang decode
  • kv_infix computed once from KV_CACHE_PREFIX; _update_retained_states
    uses it to feed the correct buffer names back as inputs for the next step

Runtime Behaviour

Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
@quic-rishinr quic-rishinr requested a review from vbaddi June 15, 2026 14:21
@quic-sanising

Copy link
Copy Markdown
Contributor

LGTM!

Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
@quic-rishinr quic-rishinr merged commit b1856f7 into main Jun 16, 2026
6 of 8 checks passed
@quic-rishinr quic-rishinr deleted the bug_fix_kv_cache_buffer-mainline branch June 16, 2026 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants