Bug fixes for missing kv_cache_buffer#1082
Merged
Merged
Conversation
Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
vbaddi
approved these changes
Jun 15, 2026
Contributor
|
LGTM! |
Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
PR #1046 introduced the optional
kv_cache_prefixparameter forQEFFAutoModelForCausalLM, single-QPC VLMs, and dual-QPC VLMs. When set,it injects an infix token into LLM KV-cache retained-state buffer names:
This gives vLLM a stable regex handle to select only LLM KV buffers for
disaggregated device-to-device transfer, leaving vision buffers
(
vision_embeds_RetainedState, etc.) untouched.The feature worked for non-layerwise paths. This work extends it to the
layerwise E-P-D path used by large models (e.g. Qwen3.5-122B-A10B).
Bugs Fixed
Bug 1 —
kv_cache_prefixdropped in layerwisecompile()andexport()File:
QEfficient/transformers/models/modeling_auto.py_QEffAutoModelForImageTextToTextDualQPC.compile()and.export()bothaccept
kv_cache_prefixas a named parameter, but neither forwarded itwhen taking the
layerwise=Trueearly-return branch:compile()→_run_layerwise_compile(...)— missingcompile()→vision_wrapper.compile(...)(vision-only branch) — missingexport()→_run_layerwise_export(...)— missing (explicit named param,not captured by
**kwargs)Fix: Added
kv_cache_prefix=kv_cache_prefixto all three call sites.Bug 2 —
_export_layerwiserebuilt output names from scratchFile:
QEfficient/base/modeling_qeff.py_export_layerwiseconstructed its ownoutput_namelist using hardcodedtemplates:
It never consulted the
output_namesparameter passed in by the caller,which already carried the prefixed names from
apply_kv_cache_prefix.The per-window ONNXes were therefore exported with plain names, and the
stitched merged ONNX had no prefix — making
kv_cache_prefixsilently ano-op for all layerwise exports.
Fix: Built a
_caller_retained_mapfrom the incomingoutput_names:for each KV retained-state name it finds the plain stem (e.g.
past_key.3_vllmKvCache_RetainedState→ plain stempast_key.3),then registers the prefixed+suffix-corrected form under both
_RetainedStateand_InternalRetainedStatekeys. When appending eachKV buffer to
output_name, the map is consulted first; the plain defaultis used as fallback (preserving existing behaviour when no prefix is set).
Bug 3 — KV input names not aligned in
_export_layerwiseFile:
QEfficient/base/modeling_qeff.pyThe regular
_exportpath callsalign_kv_input_names_to_retained_outputsto rename KV input buffers to match the prefixed outputs:
The AIC compiler pairs an output
X_RetainedStateto the input namedX;without this rename the compiler cannot establish KV retention for prefixed
buffers.
_export_layerwiseskipped this alignment entirely.Fix: Added the same
align_kv_input_names_to_retained_outputscall(with dynamic-axes propagation) immediately before
torch.onnx.exportin_export_layerwise.Files Changed
QEfficient/transformers/models/modeling_auto.pykv_cache_prefixin all layerwisecompile()andexport()branchesQEfficient/base/modeling_qeff.pyoutput_namesand align input names in_export_layerwiseexamples/image_text_to_text/models/qwen3_5_moe/qwen3_5_122b_epd_disagg.pyExample Script Notes (
qwen3_5_122b_epd_disagg.py)kv_infixcomputed once fromKV_CACHE_PREFIX;_update_retained_statesuses it to feed the correct buffer names back as inputs for the next step
Runtime Behaviour