Skip to content

llm_runner: plumb prefill temperature#20244

Merged
mergennachin merged 1 commit into
mainfrom
llm-pr-d-prefill-temperature-logits
Jun 15, 2026
Merged

llm_runner: plumb prefill temperature#20244
mergennachin merged 1 commit into
mainfrom
llm-pr-d-prefill-temperature-logits

Conversation

@mergennachin

@mergennachin mergennachin commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Session-based serving drives generation as prefill plus token steps instead of one monolithic generate call. For that path to be correct, the first sampled token produced during prefill must honor the same sampling inputs as the rest of the decode loop; otherwise requests using temperature can silently start greedily and then switch behavior on later tokens.

This threads optional temperature through TextPrefiller and exposes the existing TextTokenGenerator logit-processor application so token-step callers can reuse the same sampling preparation as generate(). The goal is to remove a divergence point before session-backed serving starts depending on these primitives.

Default behavior remains greedy, so existing callers that do not pass temperature keep the same semantics. The added tests focus on the new non-default path and on sharing the logit-processor logic rather than duplicating it.

#20001

Copilot AI review requested due to automatic review settings June 12, 2026 20:15
@pytorch-bot

pytorch-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20244

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 12, 2026
@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR threads sampling temperature through TextPrefiller so the first sampled token (produced during prefill in session-based serving) uses the same sampling inputs as subsequent decode steps, and exposes TextTokenGenerator’s logit-processor application to keep decode paths consistent.

Changes:

  • Expose TextTokenGenerator::apply_logit_processors() (and is_eos()) so token-step callers can reuse the same logit-processing logic as generate().
  • Extend TextPrefiller::prefill() / prefill_chunk() to accept an optional temperature, applied only to the final chunk’s sampled token.
  • Update TextPrefiller unit tests for the new prefill_chunk signature (but currently without asserting temperature behavior).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
extension/llm/runner/text_token_generator.h Adds public helpers for applying logit processors and EOS checking; generate() now calls the helper.
extension/llm/runner/text_prefiller.h Adds temperature parameter to prefill APIs (currently via virtual signature change + default arg).
extension/llm/runner/text_prefiller.cpp Threads temperature into logits_to_token() and ensures only the last chunk uses non-greedy sampling.
extension/llm/runner/test/test_text_prefiller.cpp Updates mocks/expectations for new prefill_chunk signature; does not yet assert temperature forwarding/last-chunk behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread extension/llm/runner/text_prefiller.h Outdated
Comment thread extension/llm/runner/text_prefiller.h Outdated
Comment thread extension/llm/runner/test/test_text_prefiller.cpp Outdated
Comment thread extension/llm/runner/test/test_text_prefiller.cpp
@mergennachin

Copy link
Copy Markdown
Contributor Author

@claude Review this PR

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

Claude finished @mergennachin's task in 5m 57s —— View job


Review: llm_runner: plumb prefill temperature

  • Read changed files
  • Analyze correctness & API design
  • Check test coverage

Overall this is a clean, well-scoped change. The committed state already addresses the Copilot reviewer's concerns: prefill() / prefill_chunk() keep separate 2-arg (greedy) and 3-arg virtual overloads with no default argument, the 2-arg forwards to the 3-arg with 0.0f, and tests now assert temperature forwarding, the greedy default, and last-chunk-only sampling. Default behavior is unchanged for existing callers.

Main point worth discussing

Logit processors are not applied to the prefill-sampled token, so the divergence the PR targets is only partially closed.

The stated goal is that "the first sampled token produced during prefill must honor the same sampling inputs as the rest of the decode loop." generate() does:

apply_logit_processors(logits_tensor);   // grammar/penalties/top-k/top-p
cur_token = logits_to_token(logits_tensor, temperature);

but prefill_chunk() (text_prefiller.cpp:116, :153) samples with temperature only:

cur_token = text_decoder_runner_->logits_to_token(outputs_res.get(), temperature);

So a request that uses both temperature and a logit processor (grammar mask, repetition penalty, top-k/top-p) would still get a first token that diverges from the decode loop. This is presumably why apply_logit_processors() was made public — the session orchestrator is expected to apply it — but TextPrefiller has no handle to the processors, so the divergence isn't actually removed inside this layer. Worth a sentence in the PR description / a TODO clarifying that processors are applied by the session caller, not the prefiller, so the next reader doesn't assume prefill is fully consistent with decode.

Minor

  • Dual virtual overloads are slightly fragile. Because both the 2-arg and 3-arg prefill_chunk are virtual, a subclass that overrides only the 2-arg version will not have its override invoked by prefill() (which always calls the 3-arg form). The new TwoArgumentPrefillChunkOverrideStillDispatches test only checks direct 2-arg dispatch, not routing through prefill(). No in-tree subclass overrides these (only the test mocks do), so this is theoretical — but if back-compat for 2-arg overriders is the goal, that path isn't actually exercised. The cleaner alternative is a single virtual 3-arg method with a non-virtual inline 2-arg forwarder; the current shape is a fine compromise if you specifically want 2-arg mocks to keep working.

  • apply_logit_processors() and is_eos() are exposed with no in-tree caller yet. The doc comments reference a session decode_one() that doesn't exist in this PR. That's acceptable given the stated "remove a divergence point before session-backed serving" intent, just flagging that they're public API ahead of their consumer.

Tests

Coverage on the new path is good: temperature forwarding, greedy 2-arg default, last-chunk-only temperature, and error propagation are all asserted. The Copilot review's test requests are satisfied by the committed version.

No blocking issues. The logit-processor consistency note above is the one item I'd recommend clarifying before session serving starts depending on these primitives.
· branch llm-pr-d-prefill-temperature-logits

}

/// Whether `token` is an end-of-sequence token (used by single-step decode).
inline bool is_eos(uint64_t token) const {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems no one using it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing for now

EXPECT_EQ(result.get(), 42);
}

TEST_F(TextPrefillerTest, TwoArgumentPrefillUsesGreedyTemperature) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz add tests here to guard the temperature range. Noramlly it should be in [0, 1] and we should raise error if temperature is out of range

Session-based serving drives generation as prefill plus token steps instead of one monolithic generate call. For that path to be correct, the first sampled token produced during prefill must honor the same sampling inputs as the rest of the decode loop; otherwise requests using temperature can silently start greedily and then switch behavior on later tokens.

This threads optional temperature through TextPrefiller and exposes the existing TextTokenGenerator logit-processor application so token-step callers can reuse the same sampling preparation as generate(). The goal is to remove a divergence point before session-backed serving starts depending on these primitives.

Default behavior remains greedy, so existing callers that do not pass temperature keep the same semantics. The added tests focus on the new non-default path and on sharing the logit-processor logic rather than duplicating it.
Copilot AI review requested due to automatic review settings June 15, 2026 21:16
@mergennachin mergennachin force-pushed the llm-pr-d-prefill-temperature-logits branch from 878f15c to 4a38f8f Compare June 15, 2026 21:16
@mergennachin mergennachin merged commit e257a71 into main Jun 15, 2026
174 of 176 checks passed
@mergennachin mergennachin deleted the llm-pr-d-prefill-temperature-logits branch June 15, 2026 21:19

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 13 comments.

Comment on lines +91 to +95
ET_CHECK_OR_RETURN_ERROR(
temperature >= 0.0f && temperature <= 1.0f,
InvalidArgument,
"Temperature must be in [0, 1], got %f",
static_cast<double>(temperature));
Comment on lines +40 to +44
ET_CHECK_OR_RETURN_ERROR(
temperature >= 0.0f && temperature <= 1.0f,
InvalidArgument,
"Temperature must be in [0, 1], got %f",
static_cast<double>(temperature));
Comment on lines +100 to +104
ET_CHECK_OR_RETURN_ERROR(
temperature >= 0.0f && temperature <= 1.0f,
InvalidArgument,
"Temperature must be in [0, 1], got %f",
static_cast<double>(temperature));
std::vector<uint64_t>& prompt_tokens,
int64_t& start_pos,
float temperature) {
ET_CHECK_MSG(!prompt_tokens.empty(), "Prompt cannot be null");
Comment on lines +69 to +76
// Only the final chunk samples the first generated token.
const bool is_last_chunk =
num_tokens_to_process + num_tokens_to_prefill_with >=
num_prompt_tokens;
auto chunk_result = prefill_chunk(
prompt_tokens_to_process,
start_pos,
is_last_chunk ? temperature : 0.0f);
Comment on lines 732 to 735
| `max_new_tokens` | `int32_t` | `-1` | Maximum new tokens to generate (-1 = use available context) |
| `seq_len` | `int32_t` | `1024` | Total sequence length including prompt |
| `temperature` | `float` | `0.8f` | Sampling temperature (0.0 = deterministic, 1.0+ = creative) |
| `temperature` | `float` | `0.8f` | Sampling temperature in [0.0, 1.0] (0.0 = deterministic) |
| `echo` | `bool` | `true` | Whether to echo the input prompt |
Comment on lines 823 to 828
GenerationConfig config;
config.temperature = 0.1f; // Very deterministic
runner->generate(factual_prompt, config, callback);

config.temperature = 1.2f; // Very creative
config.temperature = 1.0f; // Highest supported temperature
runner->generate(creative_prompt, config, callback);
Comment on lines +712 to +730
TEST_F(RunnerTest, TextTokenGeneratorRejectsTemperatureOutOfRange) {
auto tokenizer = createMockTokenizer();
auto text_decoder_runner = createMockTextDecoderRunner();
Stats stats;
auto generator = createTextTokenGenerator(
tokenizer.get(), text_decoder_runner.get(), &stats);

std::vector<uint64_t> tokens = {1, 2, 3};
EXPECT_CALL(*text_decoder_runner, step(_, _)).Times(0);

EXPECT_EQ(
generator->generate(tokens, 3, 3, -0.1f, [](const std::string&) {})
.error(),
Error::InvalidArgument);
EXPECT_EQ(
generator->generate(tokens, 3, 3, 1.1f, [](const std::string&) {})
.error(),
Error::InvalidArgument);
}
Comment on lines +196 to +201
EXPECT_EQ(
prefiller->prefill(prompt_tokens, start_pos, -0.1f).error(),
Error::InvalidArgument);
EXPECT_EQ(
prefiller->prefill(prompt_tokens, start_pos, 1.1f).error(),
Error::InvalidArgument);
Comment on lines +370 to +375
EXPECT_EQ(
prefiller->prefill_chunk(prompt_tokens, start_pos, -0.1f).error(),
Error::InvalidArgument);
EXPECT_EQ(
prefiller->prefill_chunk(prompt_tokens, start_pos, 1.1f).error(),
Error::InvalidArgument);
Gasoonjia added a commit that referenced this pull request Jun 17, 2026
Summary:
Forward fix for D108707577 (llm_runner: plumb prefill temperature,
#20244).

The new test RunnerTest.TextTokenGeneratorRejectsTemperatureOutOfRange
aborts:

  [ RUN ] RunnerTest.TextTokenGeneratorRejectsTemperatureOutOfRange
ExecuTorch PAL must be initialized before call to et_pal_current_ticks()
  *** Signal 6 (SIGABRT) ***

The new test drives the temperature-rejection path, which emits an
ET_LOG (and
thus calls et_pal_current_ticks()) before any model load. The RunnerTest
fixture
never initializes the ExecuTorch runtime, so the timer call aborts. The
valid-
temperature tests in the same fixture pass because their happy path does
not log.

Fix: initialize the runtime in RunnerTest::SetUp(), matching the
established
pattern used by the sibling tests in this same directory
(test_text_prefiller,
test_util, test_wav_loader all call executorch::runtime::runtime_init()
in
SetUp()). Applied to both the fbcode and xplat copies.

Differential Revision: D108831322
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants