Skip to content

[blog]Serving Ling-2.6-1T on TPU with SGLang-JAX#348

Open
JamesBrianD wants to merge 7 commits into
lm-sys:mainfrom
JamesBrianD:blog/ling-2-6-tpu
Open

[blog]Serving Ling-2.6-1T on TPU with SGLang-JAX#348
JamesBrianD wants to merge 7 commits into
lm-sys:mainfrom
JamesBrianD:blog/ling-2-6-tpu

Conversation

@JamesBrianD

Copy link
Copy Markdown

Summary

  • Add a new Ling-2.6-1T TPU serving blog post.
  • Document Fused MoE V2, the TPU execution model, V1/V2 kernel changes, and benchmark results.
  • Add PNG assets for the hero image, throughput charts, TPU execution model, MoE pipeline timelines, and overlap breakdown.

Validation

  • npm run build

@JamesBrianD JamesBrianD marked this pull request as ready for review June 11, 2026 18:58
JamesBrianD and others added 2 commits June 12, 2026 17:06
- Retitle to highlight the Pallas fused MoE kernel core (hiding data
  movement behind compute)
- Fix non-native phrasing, dangling modifiers, and naming consistency
  (Fused MoE V1/V2, FusedEPMoE, fp8, hidden-dimension slices)
- Credit Fused MoE V1 authors (tpu-inference) and add SGLang-JAX
  adapted-kernel reference; renumber references
- Move full TPU-vs-GPU comparison (incl. prefill gap) to benchmarks
  next to the GLA prefill note; renumber figures
- Deduplicate: GLA section, memory pools, DP, future-work bullets,
  AIME result; fold Accuracy section into appendix
- Clarify measurement scopes (in-kernel vs standalone all-to-all) and
  per-device vs per-chip specs; restore TPU v7x spec note in appendix

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread blog/2026-06-11-ling-2-6-tpu.md Outdated
Comment thread blog/2026-06-11-ling-2-6-tpu.md Outdated
With Fused MoE V2, MoE prefill latency drops from **5.16 ms to 2.42 ms** — and on the same SGLang decode benchmark, **16 TPU v7x chips reach 1.29×–1.77× the output throughput of 16 H200 GPUs**. The full numbers are below.

<img src="/images/blog/2026-06-11-ling-2-6-tpu/hero.png" alt="Ling-2.6-1T decode throughput, TPU v7x vs GPU H200" style="display:block; margin: 2.5em auto 0.5em auto; width: 100%; max-width: 900px;" />
<p style="text-align: center; color: #666; font-style: italic;">Figure 1. Ling-2.6-1T decode throughput on TPU v7x-16 vs H200×16, using the same SGLang benchmark with 16,384-token input and 1,024-token output.</p>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this random dataset?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,SGLang's default random dataset (sampled from ShareGPT).

Comment thread blog/2026-06-11-ling-2-6-tpu.md Outdated
- **TPU vs H200 decode:** TPU v7x-16 delivers **1.29×** the decode output throughput of H200×16 at `mc=128`, and **1.77×** at `mc=512`.
- **Beyond MoE:** The full Ling-2.6-1T bring-up also includes hybrid KV/recurrent memory pools, GLA linear attention, and single-controller data parallelism.

For the rest of the post, only a few Ling-2.6-1T facts matter: it is a **1T sparse MoE** model with **63B activated parameters per token**, **256 routed experts with top-8 routing plus one shared expert**, **per-channel fp8 MoE weights**, and a hybrid **MLA + Lightning Linear** backbone. The MoE structure drives the kernel work in the first half of this post; the hybrid backbone motivates the later memory-pool and GLA bring-up sections.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor / open to feedback: Ling-2.6-1T is a 1T sparse MoE model with ....

"For the rest of the post only a few" is not really needed. We just are basically adding a tldr model description for Ling 2.6 1.T

lets look at previous LMSys blogs? (I think reviewing against day 0 posts + some model optimization posts are good reference materials)

Comment thread blog/2026-06-11-ling-2-6-tpu.md Outdated
Comment thread blog/2026-06-11-ling-2-6-tpu.md Outdated
Comment thread blog/2026-06-11-ling-2-6-tpu.md Outdated
Comment thread blog/2026-06-11-ling-2-6-tpu.md Outdated
Comment thread blog/2026-06-11-ling-2-6-tpu.md Outdated
Comment thread blog/2026-06-11-ling-2-6-tpu.md Outdated
Comment thread blog/2026-06-11-ling-2-6-tpu.md
- Remove em-dashes from prose (use commas/colons/semicolons), per review
- Spell out MoE/MXU/VPU on first use; add TPU system architecture link
- Tighten model intro and MoE section heading; drop duplicated latency line
- Note the SGLang `random` (ShareGPT) benchmark dataset in Fig. 1 + config
- Rephrase DP/EP scaling and EPLB/GLA bullets without notation-heavy dashes

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@alexnails alexnails left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Comment thread blog/2026-06-11-ling-2-6-tpu.md Outdated
Use [n] [Title](url) format (matching 2026-05-28-mori), making the
descriptive title the link and dropping the em-dash + duplicated raw URL.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread blog/2026-06-11-ling-2-6-tpu.md Outdated

#### Activation quantization

V2 quantizes activations from bf16 to fp8 before scatter, directly halving the routed token payload. On Ling 16,384 prefill, the in-kernel scatter stage falls from **1.39 ms** to **0.65 ms**.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a pure gain or with trade-offs?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The activation quantization is dynamic (per-token fp8), which Ling-2.6-1T already supports, and we saw no accuracy regression in our evaluations

Comment thread blog/2026-06-11-ling-2-6-tpu.md
Comment thread blog/2026-06-11-ling-2-6-tpu.md
Comment thread blog/2026-06-11-ling-2-6-tpu.md Outdated
JamesBrianD and others added 2 commits June 15, 2026 13:08
- Note that activation quantization is dynamic per-token fp8 with no
  observed accuracy regression (addresses review question)
- Introduce "JAX devices (two per v7x chip)" on first use, shorten later
  mentions to "devices"
- "MoE cost" -> "MoE's operational cost" per review nit

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Author: Fu Haolin -> Haolin Fu (given-name-first, consistent with the rest)
- Acknowledgments: YuHong Guo -> Yuhong Guo; add 0xaskr to SGLang-JAX team

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@JamesBrianD JamesBrianD changed the title Serving Ling-2.6-1T on TPU with SGLang-JAX [blog]Serving Ling-2.6-1T on TPU with SGLang-JAX Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants