[blog]Serving Ling-2.6-1T on TPU with SGLang-JAX#348
Open
JamesBrianD wants to merge 7 commits into
Open
Conversation
e044db3 to
08d02a1
Compare
- Retitle to highlight the Pallas fused MoE kernel core (hiding data movement behind compute) - Fix non-native phrasing, dangling modifiers, and naming consistency (Fused MoE V1/V2, FusedEPMoE, fp8, hidden-dimension slices) - Credit Fused MoE V1 authors (tpu-inference) and add SGLang-JAX adapted-kernel reference; renumber references - Move full TPU-vs-GPU comparison (incl. prefill gap) to benchmarks next to the GLA prefill note; renumber figures - Deduplicate: GLA section, memory pools, DP, future-work bullets, AIME result; fold Accuracy section into appendix - Clarify measurement scopes (in-kernel vs standalone all-to-all) and per-device vs per-chip specs; restore TPU v7x spec note in appendix Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
alexnails
reviewed
Jun 13, 2026
| With Fused MoE V2, MoE prefill latency drops from **5.16 ms to 2.42 ms** — and on the same SGLang decode benchmark, **16 TPU v7x chips reach 1.29×–1.77× the output throughput of 16 H200 GPUs**. The full numbers are below. | ||
|
|
||
| <img src="/images/blog/2026-06-11-ling-2-6-tpu/hero.png" alt="Ling-2.6-1T decode throughput, TPU v7x vs GPU H200" style="display:block; margin: 2.5em auto 0.5em auto; width: 100%; max-width: 900px;" /> | ||
| <p style="text-align: center; color: #666; font-style: italic;">Figure 1. Ling-2.6-1T decode throughput on TPU v7x-16 vs H200×16, using the same SGLang benchmark with 16,384-token input and 1,024-token output.</p> |
Author
There was a problem hiding this comment.
Yes,SGLang's default random dataset (sampled from ShareGPT).
| - **TPU vs H200 decode:** TPU v7x-16 delivers **1.29×** the decode output throughput of H200×16 at `mc=128`, and **1.77×** at `mc=512`. | ||
| - **Beyond MoE:** The full Ling-2.6-1T bring-up also includes hybrid KV/recurrent memory pools, GLA linear attention, and single-controller data parallelism. | ||
|
|
||
| For the rest of the post, only a few Ling-2.6-1T facts matter: it is a **1T sparse MoE** model with **63B activated parameters per token**, **256 routed experts with top-8 routing plus one shared expert**, **per-channel fp8 MoE weights**, and a hybrid **MLA + Lightning Linear** backbone. The MoE structure drives the kernel work in the first half of this post; the hybrid backbone motivates the later memory-pool and GLA bring-up sections. |
There was a problem hiding this comment.
Minor / open to feedback: Ling-2.6-1T is a 1T sparse MoE model with ....
"For the rest of the post only a few" is not really needed. We just are basically adding a tldr model description for Ling 2.6 1.T
lets look at previous LMSys blogs? (I think reviewing against day 0 posts + some model optimization posts are good reference materials)
- Remove em-dashes from prose (use commas/colons/semicolons), per review - Spell out MoE/MXU/VPU on first use; add TPU system architecture link - Tighten model intro and MoE section heading; drop duplicated latency line - Note the SGLang `random` (ShareGPT) benchmark dataset in Fig. 1 + config - Rephrase DP/EP scaling and EPLB/GLA bullets without notation-heavy dashes Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
alexnails
approved these changes
Jun 14, 2026
Use [n] [Title](url) format (matching 2026-05-28-mori), making the descriptive title the link and dropping the em-dash + duplicated raw URL. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
hlFu
reviewed
Jun 15, 2026
hlFu
reviewed
Jun 15, 2026
|
|
||
| #### Activation quantization | ||
|
|
||
| V2 quantizes activations from bf16 to fp8 before scatter, directly halving the routed token payload. On Ling 16,384 prefill, the in-kernel scatter stage falls from **1.39 ms** to **0.65 ms**. |
Author
There was a problem hiding this comment.
Good question. The activation quantization is dynamic (per-token fp8), which Ling-2.6-1T already supports, and we saw no accuracy regression in our evaluations
hlFu
reviewed
Jun 15, 2026
hlFu
reviewed
Jun 15, 2026
hlFu
reviewed
Jun 15, 2026
- Note that activation quantization is dynamic per-token fp8 with no observed accuracy regression (addresses review question) - Introduce "JAX devices (two per v7x chip)" on first use, shorten later mentions to "devices" - "MoE cost" -> "MoE's operational cost" per review nit Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Author: Fu Haolin -> Haolin Fu (given-name-first, consistent with the rest) - Acknowledgments: YuHong Guo -> Yuhong Guo; add 0xaskr to SGLang-JAX team Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
hlFu
approved these changes
Jun 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Validation
npm run build