llm_runner: plumb prefill temperature by mergennachin · Pull Request #20244 · pytorch/executorch

mergennachin · 2026-06-12T20:15:36Z

Session-based serving drives generation as prefill plus token steps instead of one monolithic generate call. For that path to be correct, the first sampled token produced during prefill must honor the same sampling inputs as the rest of the decode loop; otherwise requests using temperature can silently start greedily and then switch behavior on later tokens.

This threads optional temperature through TextPrefiller and exposes the existing TextTokenGenerator logit-processor application so token-step callers can reuse the same sampling preparation as generate(). The goal is to remove a divergence point before session-backed serving starts depending on these primitives.

Default behavior remains greedy, so existing callers that do not pass temperature keep the same semantics. The added tests focus on the new non-default path and on sharing the logit-processor logic rather than duplicating it.

#20001

pytorch-bot · 2026-06-12T20:15:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20244

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-12T20:16:29Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

This PR threads sampling temperature through TextPrefiller so the first sampled token (produced during prefill in session-based serving) uses the same sampling inputs as subsequent decode steps, and exposes TextTokenGenerator’s logit-processor application to keep decode paths consistent.

Changes:

Expose TextTokenGenerator::apply_logit_processors() (and is_eos()) so token-step callers can reuse the same logit-processing logic as generate().
Extend TextPrefiller::prefill() / prefill_chunk() to accept an optional temperature, applied only to the final chunk’s sampled token.
Update TextPrefiller unit tests for the new prefill_chunk signature (but currently without asserting temperature behavior).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
extension/llm/runner/text_token_generator.h	Adds public helpers for applying logit processors and EOS checking; `generate()` now calls the helper.
extension/llm/runner/text_prefiller.h	Adds `temperature` parameter to prefill APIs (currently via virtual signature change + default arg).
extension/llm/runner/text_prefiller.cpp	Threads `temperature` into `logits_to_token()` and ensures only the last chunk uses non-greedy sampling.
extension/llm/runner/test/test_text_prefiller.cpp	Updates mocks/expectations for new `prefill_chunk` signature; does not yet assert temperature forwarding/last-chunk behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mergennachin · 2026-06-12T21:20:39Z

@claude Review this PR

claude · 2026-06-12T21:21:21Z

Claude finished @mergennachin's task in 5m 57s —— View job

Review: `llm_runner: plumb prefill temperature`

Read changed files
Analyze correctness & API design
Check test coverage

Overall this is a clean, well-scoped change. The committed state already addresses the Copilot reviewer's concerns: prefill() / prefill_chunk() keep separate 2-arg (greedy) and 3-arg virtual overloads with no default argument, the 2-arg forwards to the 3-arg with 0.0f, and tests now assert temperature forwarding, the greedy default, and last-chunk-only sampling. Default behavior is unchanged for existing callers.

Main point worth discussing

Logit processors are not applied to the prefill-sampled token, so the divergence the PR targets is only partially closed.

The stated goal is that "the first sampled token produced during prefill must honor the same sampling inputs as the rest of the decode loop." generate() does:

apply_logit_processors(logits_tensor);   // grammar/penalties/top-k/top-p
cur_token = logits_to_token(logits_tensor, temperature);

but prefill_chunk() (text_prefiller.cpp:116, :153) samples with temperature only:

cur_token = text_decoder_runner_->logits_to_token(outputs_res.get(), temperature);

So a request that uses both temperature and a logit processor (grammar mask, repetition penalty, top-k/top-p) would still get a first token that diverges from the decode loop. This is presumably why apply_logit_processors() was made public — the session orchestrator is expected to apply it — but TextPrefiller has no handle to the processors, so the divergence isn't actually removed inside this layer. Worth a sentence in the PR description / a TODO clarifying that processors are applied by the session caller, not the prefiller, so the next reader doesn't assume prefill is fully consistent with decode.

Minor

Dual virtual overloads are slightly fragile. Because both the 2-arg and 3-arg prefill_chunk are virtual, a subclass that overrides only the 2-arg version will not have its override invoked by prefill() (which always calls the 3-arg form). The new TwoArgumentPrefillChunkOverrideStillDispatches test only checks direct 2-arg dispatch, not routing through prefill(). No in-tree subclass overrides these (only the test mocks do), so this is theoretical — but if back-compat for 2-arg overriders is the goal, that path isn't actually exercised. The cleaner alternative is a single virtual 3-arg method with a non-virtual inline 2-arg forwarder; the current shape is a fine compromise if you specifically want 2-arg mocks to keep working.
apply_logit_processors() and is_eos() are exposed with no in-tree caller yet. The doc comments reference a session decode_one() that doesn't exist in this PR. That's acceptable given the stated "remove a divergence point before session-backed serving" intent, just flagging that they're public API ahead of their consumer.

Tests

Coverage on the new path is good: temperature forwarding, greedy 2-arg default, last-chunk-only temperature, and error propagation are all asserted. The Copilot review's test requests are satisfied by the committed version.

No blocking issues. The logit-processor consistency note above is the one item I'd recommend clarifying before session serving starts depending on these primitives.
· branch llm-pr-d-prefill-temperature-logits

Gasoonjia · 2026-06-15T20:44:26Z

  }

+  /// Whether `token` is an end-of-sequence token (used by single-step decode).
+  inline bool is_eos(uint64_t token) const {


seems no one using it?

removing for now

Gasoonjia · 2026-06-15T20:46:38Z

  EXPECT_EQ(result.get(), 42);
 }

+TEST_F(TextPrefillerTest, TwoArgumentPrefillUsesGreedyTemperature) {


plz add tests here to guard the temperature range. Noramlly it should be in [0, 1] and we should raise error if temperature is out of range

Session-based serving drives generation as prefill plus token steps instead of one monolithic generate call. For that path to be correct, the first sampled token produced during prefill must honor the same sampling inputs as the rest of the decode loop; otherwise requests using temperature can silently start greedily and then switch behavior on later tokens. This threads optional temperature through TextPrefiller and exposes the existing TextTokenGenerator logit-processor application so token-step callers can reuse the same sampling preparation as generate(). The goal is to remove a divergence point before session-backed serving starts depending on these primitives. Default behavior remains greedy, so existing callers that do not pass temperature keep the same semantics. The added tests focus on the new non-default path and on sharing the logit-processor logic rather than duplicating it.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 13 comments.

+    ET_CHECK_OR_RETURN_ERROR(
+        temperature >= 0.0f && temperature <= 1.0f,
+        InvalidArgument,
+        "Temperature must be in [0, 1], got %f",
+        static_cast<double>(temperature));


+  ET_CHECK_OR_RETURN_ERROR(
+      temperature >= 0.0f && temperature <= 1.0f,
+      InvalidArgument,
+      "Temperature must be in [0, 1], got %f",
+      static_cast<double>(temperature));


+  ET_CHECK_OR_RETURN_ERROR(
+      temperature >= 0.0f && temperature <= 1.0f,
+      InvalidArgument,
+      "Temperature must be in [0, 1], got %f",
+      static_cast<double>(temperature));


+    std::vector<uint64_t>& prompt_tokens,
+    int64_t& start_pos,
+    float temperature) {
  ET_CHECK_MSG(!prompt_tokens.empty(), "Prompt cannot be null");


+      // Only the final chunk samples the first generated token.
+      const bool is_last_chunk =
+          num_tokens_to_process + num_tokens_to_prefill_with >=
+          num_prompt_tokens;
+      auto chunk_result = prefill_chunk(
+          prompt_tokens_to_process,
+          start_pos,
+          is_last_chunk ? temperature : 0.0f);


 | `max_new_tokens` | `int32_t` | `-1` | Maximum new tokens to generate (-1 = use available context) |
 | `seq_len` | `int32_t` | `1024` | Total sequence length including prompt |
-| `temperature` | `float` | `0.8f` | Sampling temperature (0.0 = deterministic, 1.0+ = creative) |
+| `temperature` | `float` | `0.8f` | Sampling temperature in [0.0, 1.0] (0.0 = deterministic) |
 | `echo` | `bool` | `true` | Whether to echo the input prompt |


 GenerationConfig config;
 config.temperature = 0.1f;  // Very deterministic
 runner->generate(factual_prompt, config, callback);

-config.temperature = 1.2f;  // Very creative
+config.temperature = 1.0f;  // Highest supported temperature
 runner->generate(creative_prompt, config, callback);


+TEST_F(RunnerTest, TextTokenGeneratorRejectsTemperatureOutOfRange) {
+  auto tokenizer = createMockTokenizer();
+  auto text_decoder_runner = createMockTextDecoderRunner();
+  Stats stats;
+  auto generator = createTextTokenGenerator(
+      tokenizer.get(), text_decoder_runner.get(), &stats);
+
+  std::vector<uint64_t> tokens = {1, 2, 3};
+  EXPECT_CALL(*text_decoder_runner, step(_, _)).Times(0);
+
+  EXPECT_EQ(
+      generator->generate(tokens, 3, 3, -0.1f, [](const std::string&) {})
+          .error(),
+      Error::InvalidArgument);
+  EXPECT_EQ(
+      generator->generate(tokens, 3, 3, 1.1f, [](const std::string&) {})
+          .error(),
+      Error::InvalidArgument);
+}


+  EXPECT_EQ(
+      prefiller->prefill(prompt_tokens, start_pos, -0.1f).error(),
+      Error::InvalidArgument);
+  EXPECT_EQ(
+      prefiller->prefill(prompt_tokens, start_pos, 1.1f).error(),
+      Error::InvalidArgument);


+  EXPECT_EQ(
+      prefiller->prefill_chunk(prompt_tokens, start_pos, -0.1f).error(),
+      Error::InvalidArgument);
+  EXPECT_EQ(
+      prefiller->prefill_chunk(prompt_tokens, start_pos, 1.1f).error(),
+      Error::InvalidArgument);


Summary: Forward fix for D108707577 (llm_runner: plumb prefill temperature, #20244). The new test RunnerTest.TextTokenGeneratorRejectsTemperatureOutOfRange aborts: [ RUN ] RunnerTest.TextTokenGeneratorRejectsTemperatureOutOfRange ExecuTorch PAL must be initialized before call to et_pal_current_ticks() *** Signal 6 (SIGABRT) *** The new test drives the temperature-rejection path, which emits an ET_LOG (and thus calls et_pal_current_ticks()) before any model load. The RunnerTest fixture never initializes the ExecuTorch runtime, so the timer call aborts. The valid- temperature tests in the same fixture pass because their happy path does not log. Fix: initialize the runtime in RunnerTest::SetUp(), matching the established pattern used by the sibling tests in this same directory (test_text_prefiller, test_util, test_wav_loader all call executorch::runtime::runtime_init() in SetUp()). Applied to both the fbcode and xplat copies. Differential Revision: D108831322

Copilot AI review requested due to automatic review settings June 12, 2026 20:15

mergennachin requested a review from larryliu0820 as a code owner June 12, 2026 20:15

mergennachin temporarily deployed to cadence June 12, 2026 20:15 — with GitHub Actions Inactive

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 12, 2026

mergennachin requested review from kirklandsign and psiddh June 12, 2026 20:15

Copilot started reviewing on behalf of mergennachin June 12, 2026 20:16 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread extension/llm/runner/text_prefiller.h Outdated

Comment thread extension/llm/runner/text_prefiller.h Outdated

Comment thread extension/llm/runner/test/test_text_prefiller.cpp Outdated

Comment thread extension/llm/runner/test/test_text_prefiller.cpp

mergennachin force-pushed the llm-pr-d-prefill-temperature-logits branch from 7fde821 to 878f15c Compare June 12, 2026 21:12

mergennachin temporarily deployed to cadence June 12, 2026 21:12 — with GitHub Actions Inactive

Gasoonjia approved these changes Jun 15, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings June 15, 2026 21:16

mergennachin force-pushed the llm-pr-d-prefill-temperature-logits branch from 878f15c to 4a38f8f Compare June 15, 2026 21:16

mergennachin temporarily deployed to cadence June 15, 2026 21:16 — with GitHub Actions Inactive

Copilot started reviewing on behalf of mergennachin June 15, 2026 21:16 View session

mergennachin merged commit e257a71 into main Jun 15, 2026
174 of 176 checks passed

mergennachin deleted the llm-pr-d-prefill-temperature-logits branch June 15, 2026 21:19

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Gasoonjia mentioned this pull request Jun 17, 2026

Init runtime PAL in RunnerTest (forward fix for D108707577) #20319

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm_runner: plumb prefill temperature#20244

llm_runner: plumb prefill temperature#20244
mergennachin merged 1 commit into
mainfrom
llm-pr-d-prefill-temperature-logits

mergennachin commented Jun 12, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergennachin commented Jun 12, 2026

Uh oh!

claude Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Gasoonjia Jun 15, 2026

Uh oh!

mergennachin Jun 15, 2026

Uh oh!

Gasoonjia Jun 15, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mergennachin commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20244

Uh oh!

github-actions Bot commented Jun 12, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergennachin commented Jun 12, 2026

Uh oh!

claude Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: llm_runner: plumb prefill temperature

Main point worth discussing

Minor

Tests

Uh oh!

Gasoonjia Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

mergennachin Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Gasoonjia Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mergennachin commented Jun 12, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 12, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 12, 2026 •

edited

Loading

Review: `llm_runner: plumb prefill temperature`