[blog]Serving Ling-2.6-1T on TPU with SGLang-JAX by JamesBrianD · Pull Request #348 · lm-sys/lm-sys.github.io

JamesBrianD · 2026-06-11T18:49:52Z

Summary

Add a new Ling-2.6-1T TPU serving blog post.
Document Fused MoE V2, the TPU execution model, V1/V2 kernel changes, and benchmark results.
Add PNG assets for the hero image, throughput charts, TPU execution model, MoE pipeline timelines, and overlap breakdown.

Validation

npm run build

- Retitle to highlight the Pallas fused MoE kernel core (hiding data movement behind compute) - Fix non-native phrasing, dangling modifiers, and naming consistency (Fused MoE V1/V2, FusedEPMoE, fp8, hidden-dimension slices) - Credit Fused MoE V1 authors (tpu-inference) and add SGLang-JAX adapted-kernel reference; renumber references - Move full TPU-vs-GPU comparison (incl. prefill gap) to benchmarks next to the GLA prefill note; renumber figures - Deduplicate: GLA section, memory pools, DP, future-work bullets, AIME result; fold Accuracy section into appendix - Clarify measurement scopes (in-kernel vs standalone all-to-all) and per-device vs per-chip specs; restore TPU v7x spec note in appendix Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

alexnails · 2026-06-13T16:47:54Z

+With Fused MoE V2, MoE prefill latency drops from **5.16 ms to 2.42 ms** — and on the same SGLang decode benchmark, **16 TPU v7x chips reach 1.29×–1.77× the output throughput of 16 H200 GPUs**. The full numbers are below.
+
+<img src="/images/blog/2026-06-11-ling-2-6-tpu/hero.png" alt="Ling-2.6-1T decode throughput, TPU v7x vs GPU H200" style="display:block; margin: 2.5em auto 0.5em auto; width: 100%; max-width: 900px;" />
+<p style="text-align: center; color: #666; font-style: italic;">Figure 1. Ling-2.6-1T decode throughput on TPU v7x-16 vs H200×16, using the same SGLang benchmark with 16,384-token input and 1,024-token output.</p>


is this random dataset?

Yes，SGLang's default random dataset (sampled from ShareGPT).

alexnails · 2026-06-13T16:50:22Z

+- **TPU vs H200 decode:** TPU v7x-16 delivers **1.29×** the decode output throughput of H200×16 at `mc=128`, and **1.77×** at `mc=512`.
+- **Beyond MoE:** The full Ling-2.6-1T bring-up also includes hybrid KV/recurrent memory pools, GLA linear attention, and single-controller data parallelism.
+
+For the rest of the post, only a few Ling-2.6-1T facts matter: it is a **1T sparse MoE** model with **63B activated parameters per token**, **256 routed experts with top-8 routing plus one shared expert**, **per-channel fp8 MoE weights**, and a hybrid **MLA + Lightning Linear** backbone. The MoE structure drives the kernel work in the first half of this post; the hybrid backbone motivates the later memory-pool and GLA bring-up sections.


Minor / open to feedback: Ling-2.6-1T is a 1T sparse MoE model with ....

"For the rest of the post only a few" is not really needed. We just are basically adding a tldr model description for Ling 2.6 1.T

lets look at previous LMSys blogs? (I think reviewing against day 0 posts + some model optimization posts are good reference materials)

- Remove em-dashes from prose (use commas/colons/semicolons), per review - Spell out MoE/MXU/VPU on first use; add TPU system architecture link - Tighten model intro and MoE section heading; drop duplicated latency line - Note the SGLang `random` (ShareGPT) benchmark dataset in Fig. 1 + config - Rephrase DP/EP scaling and EPLB/GLA bullets without notation-heavy dashes Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

alexnails

🚀

Use [n] [Title](url) format (matching 2026-05-28-mori), making the descriptive title the link and dropping the em-dash + duplicated raw URL. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

hlFu · 2026-06-15T01:34:42Z

+
+#### Activation quantization
+
+V2 quantizes activations from bf16 to fp8 before scatter, directly halving the routed token payload. On Ling 16,384 prefill, the in-kernel scatter stage falls from **1.39 ms** to **0.65 ms**.


Is this a pure gain or with trade-offs?

Good question. The activation quantization is dynamic (per-token fp8), which Ling-2.6-1T already supports, and we saw no accuracy regression in our evaluations

- Note that activation quantization is dynamic per-token fp8 with no observed accuracy regression (addresses review question) - Introduce "JAX devices (two per v7x chip)" on first use, shorten later mentions to "devices" - "MoE cost" -> "MoE's operational cost" per review nit Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- Author: Fu Haolin -> Haolin Fu (given-name-first, consistent with the rest) - Acknowledgments: YuHong Guo -> Yuhong Guo; add 0xaskr to SGLang-JAX team Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

blog: add Ling-2.6 TPU serving post

08d02a1

JamesBrianD force-pushed the blog/ling-2-6-tpu branch from e044db3 to 08d02a1 Compare June 11, 2026 18:58

JamesBrianD marked this pull request as ready for review June 11, 2026 18:58

JamesBrianD and others added 2 commits June 12, 2026 17:06

blog: update Ling-2.6 TPU post acknowledgments

3f7e7d1

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

alexnails reviewed Jun 13, 2026

View reviewed changes

alexnails approved these changes Jun 14, 2026

View reviewed changes

Comment thread blog/2026-06-11-ling-2-6-tpu.md Outdated

blog: align Ling-2.6 references with house style

a96f790

Use [n] [Title](url) format (matching 2026-05-28-mori), making the descriptive title the link and dropping the em-dash + duplicated raw URL. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

hlFu reviewed Jun 15, 2026

View reviewed changes

Comment thread blog/2026-06-11-ling-2-6-tpu.md Outdated

hlFu reviewed Jun 15, 2026

View reviewed changes

Comment thread blog/2026-06-11-ling-2-6-tpu.md

hlFu reviewed Jun 15, 2026

View reviewed changes

Comment thread blog/2026-06-11-ling-2-6-tpu.md

hlFu reviewed Jun 15, 2026

View reviewed changes

Comment thread blog/2026-06-11-ling-2-6-tpu.md Outdated

JamesBrianD and others added 2 commits June 15, 2026 13:08

blog: fix author/ack name order and add 0xaskr

7eb1266

- Author: Fu Haolin -> Haolin Fu (given-name-first, consistent with the rest) - Acknowledgments: YuHong Guo -> Yuhong Guo; add 0xaskr to SGLang-JAX team Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

JamesBrianD changed the title ~~Serving Ling-2.6-1T on TPU with SGLang-JAX~~ [blog]Serving Ling-2.6-1T on TPU with SGLang-JAX Jun 15, 2026

hlFu approved these changes Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[blog]Serving Ling-2.6-1T on TPU with SGLang-JAX#348

[blog]Serving Ling-2.6-1T on TPU with SGLang-JAX#348
JamesBrianD wants to merge 7 commits into
lm-sys:mainfrom
JamesBrianD:blog/ling-2-6-tpu

JamesBrianD commented Jun 11, 2026

Uh oh!

Uh oh!

alexnails Jun 13, 2026

Uh oh!

JamesBrianD Jun 14, 2026

Uh oh!

alexnails Jun 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexnails left a comment

Uh oh!

Uh oh!

Uh oh!

hlFu Jun 15, 2026

Uh oh!

JamesBrianD Jun 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		#### Activation quantization

		V2 quantizes activations from bf16 to fp8 before scatter, directly halving the routed token payload. On Ling 16,384 prefill, the in-kernel scatter stage falls from 1.39 ms to 0.65 ms.

Conversation

JamesBrianD commented Jun 11, 2026

Summary

Validation

Uh oh!

Uh oh!

alexnails Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

JamesBrianD Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

alexnails Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexnails left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hlFu Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

JamesBrianD Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants