[Feature]Add output fallback support for OpenAI serving#7942
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
本 PR 为 OpenAI 兼容服务新增 output fallback 兜底处理框架,在 streaming / non-streaming 路径上对模型输出做后处理(修复 Markdown 加粗冒号、Markdown 表格、检测重复输出截断),并通过策略注册 + 插件机制支持自定义扩展。
Changes:
- 新增
fastdeploy/output/fallback/子包:定义OutputFallbackStrategy基类、OutputFallbackContext、StreamFallbackDecision、OutputFallbackManager,并内置markdown-bold-colon/markdown-table/repeat-truncate三个策略。 - 在
EngineArgs/ api_server 接入--output-fallback、--output-fallback-plugin、--output-fallback-config三个启动参数,并将 manager 注入到 v0 / v1 chat 和 completion 的 serving 类。 - 在 streaming / non-streaming 处理流程中调用 manager 的
apply/on_delta/on_finish/cleanup;命中 repeat-truncate 时将finish_reason设为repeat_truncate并 abort 对应 choice。
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/output/fallback/init.py | 暴露公共类并导入三个内置策略以触发注册 |
| fastdeploy/output/fallback/base.py | 定义 fallback context / decision / 抽象基类 |
| fastdeploy/output/fallback/manager.py | 注册表 / 插件加载 / apply / on_delta / on_finish / cleanup |
| fastdeploy/output/fallback/markdown_bold_colon.py | 修正 **xxx:** 冒号位置,支持跨 delta 缓存 |
| fastdeploy/output/fallback/markdown_table.py | 修复 Markdown 表格分隔行 / 列数不一致 |
| fastdeploy/output/fallback/repeat_truncate.py | 基于 token window 检测重复输出并触发 truncate |
| fastdeploy/engine/args_utils.py | 增加 3 个新 CLI 参数 |
| fastdeploy/entrypoints/openai/api_server.py | 解析参数构建 manager 并注入各 handler,/config-info 暴露相应字段 |
| fastdeploy/entrypoints/openai/serving_chat.py | v0 chat 流/非流路径接入 fallback,含 repeat_truncate finish_reason |
| fastdeploy/entrypoints/openai/serving_completion.py | v0 completion 流/非流路径接入 fallback |
| fastdeploy/entrypoints/openai/v1/serving_base.py | 基类构造接收 manager 并在 finally 清理状态 |
| fastdeploy/entrypoints/openai/v1/serving_chat.py | v1 chat 接入 fallback(非多模态路径) |
| fastdeploy/entrypoints/openai/v1/serving_completion.py | v1 completion 接入 fallback |
| tests/output/test_fallback.py | 覆盖 manager、内置策略、流式 hold/flush/truncate、cleanup、插件导入 |
| choice_completion_tokens = response_ctx.choice_completion_tokens_dict[output.index] | ||
| choice.finish_reason = self._calc_finish_reason(request_output, max_tokens, choice_completion_tokens) | ||
| if fallback_truncated: | ||
| choice.finish_reason = "repeat_truncate" |
| if res.get("error_msg") is not None and "Aborted" in res["error_msg"]: | ||
| choices[-1].finish_reason = "abort" | ||
| if fallback_truncated: | ||
| choices[-1].finish_reason = "repeat_truncate" |
| choice.finish_reason = "abort" | ||
|
|
||
| if fallback_truncated: | ||
| choice.finish_reason = "repeat_truncate" |
| if fallback_truncated: | ||
| choice.finish_reason = "repeat_truncate" |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7942 +/- ##
==========================================
Coverage ? 67.56%
==========================================
Files ? 475
Lines ? 66609
Branches ? 10259
==========================================
Hits ? 45005
Misses ? 18738
Partials ? 2866
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
| def import_fallback_plugin(cls, plugin_path: str) -> None: | ||
| module_name = os.path.splitext(os.path.basename(plugin_path))[0] | ||
| try: | ||
| import_from_path(module_name, plugin_path) | ||
| except Exception: | ||
| data_processor_logger.exception( | ||
| "Failed to load output fallback module '%s' from %s.", module_name, plugin_path | ||
| ) |
| def apply(self, text: str, context: OutputFallbackContext) -> str: | ||
| return text + "-suffix" | ||
|
|
||
| def on_delta(self, delta_text: str, context: OutputFallbackContext, state: dict) -> StreamFallbackDecision: |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-10 19:05:16
📋 Review 摘要
PR 概述:新增 OpenAI output fallback framework,并将 fallback 前移到 data processor。
变更范围:Engine CLI、OpenAI serving、DataProcessor、output fallback manager/plugin、相关单测。
影响面 Tag:[APIServer] [DataProcessor] [Engine]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/entrypoints/openai/api_server.py:245 |
FD_ENABLE_ASYNC_LLM=1 时 fallback manager 只挂到 engine_client,v1/AsyncLLM 实际输出链路不会执行 fallback |
| 🟡 建议 | fastdeploy/input/base_processor.py:394 |
non-streaming 空输出被 and full_text 短路,策略无法实现空回复兜底 |
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | on_finish truncate action 未向调用方传播 |
|
| F2 | 多策略 on_finish flush 处理未按链式语义传递 |
|
| F3 | on_finish context 仍携带原始 delta_text |
|
| F4 | on_delta 传入累积 buffer 而非当前 delta |
|
| F5 | hold 提前返回时 trial state 不写回 | |
| F6 | buffer 非空的 on_finish 仍先调用 strategy.on_delta |
|
| F8 | enable_mm_output text 分支关闭 fallback | ✅ 已修复 |
📝 PR 规范检查
标题已从历史的双 Tag 问题改为单 Tag,但当前标题 [Feature]Add output fallback support for OpenAI serving 仍缺少 Tag 后空格。PR 描述结构完整。
标题建议(可直接复制):
[Feature] Add output fallback support for OpenAI serving
总体评价
本轮按风险优先审查了 fallback manager 接入、DataProcessor 前置处理、OpenAI streaming/non-streaming 消费链路。非 async/v0 路径已经基本串起,但 async/v1 模式下启动参数当前不会真正生效,需要先修复。
| # (content / reasoning / tool calls) benefit. Serving handlers no longer | ||
| # invoke the manager themselves. | ||
| if output_fallback_manager is not None and getattr(engine_client, "data_processor", None) is not None: | ||
| engine_client.data_processor.output_fallback_manager = output_fallback_manager |
This comment was marked as low quality.
This comment was marked as low quality.
Sorry, something went wrong.
This comment was marked as low quality.
This comment was marked as low quality.
Sorry, something went wrong.
| # Apply output fallback to the full raw text BEFORE reasoning / | ||
| # tool parsing so all sub-streams (content, reasoning, tools) | ||
| # benefit from the rewrite. | ||
| if output_fallback_manager is not None and full_text: |
There was a problem hiding this comment.
🟡 建议 当前 guard 会让 non-streaming 空输出绕过 fallback。
OutputFallbackStrategy.should_apply() 已经是策略自己的判定入口,这里的 and full_text 会让策略无法处理空字符串,例如把模型空回复替换为默认兜底文案。Streaming 路径没有这个限制,导致流式和非流式行为不一致。
建议去掉 and full_text,始终调用 output_fallback_manager.apply(full_text, context),是否处理空文本由 strategy.should_apply() 决定。
Motivation
当前推理链路缺少统一的 output fallback 扩展机制,业务侧如果希望对模型输出进行兜底处理,只能在各个下游环节分别适配,难以统一管理。
本 PR 引入 output fallback framework,并将 output fallback 的实际处理前移到 data processor 中,在 reasoning/tool parsing 之前对原始 decoded stream 做统一处理。这样可以保证内容文本、reasoning 内容以及 tool call 相关文本都能共享同一套 fallback 逻辑,同时也为后续扩展自定义 fallback strategy 提供统一入口。
Modifications
本 PR 主要包含以下改动:
新增 output fallback framework
fastdeploy/output/fallback/模块OutputFallbackStrategy抽象基类OutputFallbackContextStreamFallbackDecisionOutputFallbackManager新增 output fallback 插件加载机制
fastdeploy.plugins.output_fallbackfastdeploy.output_fallback_plugins自动加载插件--output-fallback-plugin指定外部插件路径动态导入新增 output fallback 相关启动参数
--output-fallback--output-fallback-plugin--output-fallback-config将 output fallback 的应用前移到 data processor
fastdeploy/input/base_processor.py中新增output_fallback_managerprocess_response_dict_normal()中,对完整输出文本应用 fallbackprocess_response_dict_streaming()中,对 streaming 增量文本应用 fallback支持 streaming 场景下的 fallback 控制语义
send:发送当前 deltahold:暂存当前 delta,本轮不输出flush:在流结束时输出缓存内容truncate:发送当前文本并提前终止后续生成新增 processor 侧 fallback 状态管理
fallback_decode_status扩展 request / output 数据结构
CompletionOutput中新增:fallback_truncatedskipped补充测试
tests/output/test_fallback.pyUsage or Command
启用指定 fallback strategy:
加载自定义 fallback 插件:
为策略传入配置:
--output-fallback your-strategy-name \ --output-fallback-plugin /path/to/custom_fallback.py \ --output-fallback-config '{"your-strategy-name": {"key": "value"}}'如何增加自定义兜底协议
可以通过继承 OutputFallbackStrategy 并使用 OutputFallbackManager.register(...) 注册自定义策略。
示例:
自定义策略说明:
加载方式有两种:
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.