Add Northflank launcher and runner for GPU job execution by Champ-Goblem · Pull Request #456 · gpu-mode/kernelbot

Champ-Goblem · 2026-03-04T21:46:39Z

Implement Northflank integration for running kernel benchmarks on managed GPU infrastructure with object storage result delivery.

Files:

northflank-runner.py: Container entrypoint that parses compressed config from env vars, executes benchmarks, and uploads results to object storage for retrieval
northflank.py: NorthflankLauncher that triggers jobs via REST API, polls for completion, and downloads results from storage

Features:

Configurable repo URL and branch for testing
Timeout management based on submission mode
Compressed payload encoding for config transfer
Environment-based storage configuration

…message (gpu-mode#236)

* allow ranked to run multiple benchmarks * single-benchmark, arithmetic mean, geometric mean scoring * ruff * Fix: style --------- Co-authored-by: S1ro1 <matej.sirovatka@gmail.com>

* make get-api-url ephemeral * Fix: add rstrp --------- Co-authored-by: S1ro1 <matej.sirovatka@gmail.com>

* make get-api-url ephemeral * Fix: add rstrp * Fix --------- Co-authored-by: ngc92 <7938269+ngc92@users.noreply.github.com>

* Fix: enable invalidated submissions * Fix: correct id

* Fix: log + correct url * Fix: https

* all of rocm * cxx=clang++ env

…r exists

* enable comments in HIP template * same fix for file extension

* Feat: more logging * Feat: longer timeout (probably reaching gh token limits)

* Add optional leaderboard_name filter to admin show-stats Support filtering stats by a specific leaderboard in both the Discord command and the API endpoint. Defaults to all leaderboards when omitted. * Extract _stats_filter helper and qualify column references Deduplicate JOIN/WHERE/params construction across stats queries into a shared _stats_filter staticmethod. Alias leaderboard.submission as s in _generate_submission_stats and qualify all column references. * Add Query description for leaderboard_name and use flexible assertions Add FastAPI Query metadata so leaderboard_name shows a description in OpenAPI docs. Use call_args-based assertions in tests instead of positional assert_called_once_with to decouple from argument passing style.

…pu-mode#447) Bumps the uv group with 1 update in the / directory: [sqlparse](https://github.com/andialbrecht/sqlparse). Updates `sqlparse` from 0.5.3 to 0.5.4 - [Changelog](https://github.com/andialbrecht/sqlparse/blob/master/CHANGELOG) - [Commits](andialbrecht/sqlparse@0.5.3...0.5.4) --- updated-dependencies: - dependency-name: sqlparse dependency-version: 0.5.4 dependency-type: indirect dependency-group: uv ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Adds 4 indexes to speed up ranking and user queries: - runs(submission_id) for ranking JOINs - submission(leaderboard_id) for leaderboard filtering - submission(user_id) for user lookups - runs(submission_id, runner, score) partial index for ranking computation

idx_runs_submission_id and idx_runs_ranking already existed on prod (as idx_runs_submission_id_covering and idx_runs_leaderboard_valid). Keep only the two genuinely new indexes.

* Add Helion Docker image with auto-publish workflow Dockerfile based on modal_runner.py deps, focused on torch + helion. Other packages (tinygrad, CUTLASS, cupynumeric, etc.) commented out for easy re-enabling. GitHub Actions workflow builds and pushes to ghcr.io/gpu-mode/helion-runner on changes to the Dockerfile. * Bump torch to 2.10.0 in Helion Dockerfile

Switch from nvidia/cuda base to ghcr.io/actions/actions-runner:latest so the image includes the GitHub Actions runner agent, matching the AMD runner pattern. Install CUDA 13.1 toolkit from NVIDIA apt repos and add sudo to all install commands for the runner user.

ensurepip and pip install were running as the runner user, placing uv in ~/.local/bin which sudo cannot find. Use sudo so uv installs to /usr/local/bin.

…token Replace deprecated `Github(token)` with `Github(auth=Auth.Token(token))` to silence the DeprecationWarning from PyGithub.

add index for query leaderboard ranking

…pu-mode#453) These are dependencies needed by the gated_deltanet and causal_conv kernel benchmarks.

Downgrade CUDA toolkit from 13.1 to 13.0 to match the PyTorch cu130 wheel index, and drop the torch==2.10.0 pin which was causing uv to fall back to a CUDA 12.8 wheel.

uv's build isolation installs torch from default PyPI (cu128) instead of the cu130 system torch, causing the CUDA version check to fail. Using --no-build-isolation makes the build use the system torch.

* Add MI355X GPU support for AMD GitHub runner Add MI355X to GitHubGPU enum, GPU_TO_SM mapping, and github launcher runner routing with runner label mia1-p02-g29. * Use amd-runner Docker container for MI355X workflow Add container image ghcr.io/gpu-mode/amd-runner:main with GPU device passthrough to amd_workflow.yml. Add numpy to AMD_REQUIREMENTS. * Update AMD Dockerfile: ROCm 7.2, latest aiter, remove multi-GPU deps - Upgrade ROCm from 6.3.1 to 7.2 - Upgrade PyTorch to nightly rocm7.2 - Update aiter to latest commit (f3be04a) for recent FP4 kernel APIs - Remove UCX, OpenMPI, and rocSHMEM builds (no longer needed) * Update AMD_REQUIREMENTS to use ROCm 7.2 nightly index * Fix container permissions: run as root for GitHub Actions compatibility * Revert "Update AMD_REQUIREMENTS to use ROCm 7.2 nightly index" This reverts commit bb5f2ee. * Revert "Update AMD Dockerfile: ROCm 7.2, latest aiter, remove multi-GPU deps" This reverts commit bdc4523. * Simplify AMD workflow for MI355X: use container deps, skip requirements install * Reapply "Update AMD Dockerfile: ROCm 7.2, latest aiter, remove multi-GPU deps" This reverts commit e09a2cd. * Update AMD Dockerfile to ROCm 7.1 stable, latest aiter, remove multi-GPU deps - Upgrade ROCm from 6.3.1 to 7.1 (stable, matches host ROCm 7.0.1) - Use stable torch 2.10.0+rocm7.1 instead of nightly - Update aiter to latest commit (f3be04a) for recent FP4 kernel APIs - Remove UCX, OpenMPI, and rocSHMEM builds * Use mia1-p02-g29 runner to build AMD Docker image * Add workspace cleanup step before checkout in AMD Docker build Fixes EACCES errors from root-owned files left by previous container runs. * Remove workspace cleanup step from AMD Docker build * Use GITHUB_TOKEN instead of PUBLISH_TOKEN for ghcr.io login * Fix Dockerfile for Ubuntu 24.04 (Noble) base image - Replace python3.10 packages with python3 equivalents - Use noble ROCm package instead of jammy - Add --break-system-packages for pip on Noble - Remove git-core PPA (not needed on Noble) - Remove linux-headers install (not available during build) * Remove pip upgrade step (incompatible with Noble system pip) * Use amd-runner:mi355 Docker image with working aiter + ROCm * Fix pip install: add --break-system-packages for container environment * Update amd-docker.Dockerfile * Set minimum GitHub timeout to DEFAULT_GITHUB_TIMEOUT_MINUTES Ensures the workflow timeout is at least 30 minutes to account for Docker image pulls and container initialization on new runners.

Implement Northflank integration for running kernel benchmarks on managed GPU infrastructure with object storage result delivery. Files: - northflank-runner.py: Container entrypoint that parses compressed config from env vars, executes benchmarks, and uploads results to object storage for retrieval - northflank.py: NorthflankLauncher that triggers jobs via REST API, polls for completion, and downloads results from storage Features: - Configurable repo URL and branch for testing - Timeout management based on submission mode - Compressed payload encoding for config transfer - Environment-based storage configuration Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Champ-Goblem <cameron@northflank.com>

msaroufim · 2026-03-04T22:19:39Z

Thanks @Champ-Goblem! This is a good first pass but to make it ready there are a few missing things

On the launcher itself

In src/kernelbot/main.py and src/libkernelbot/consts.py you still need to add the northflank gpus and northflank backend respectively otherwise I'm not sure this actually tests a northflank launcher e2e
In src/kernelbot/env.py we want to be adding the northflank env variables as well
We are also leaking information on the benchmark infra and want to be calling del os.environ["PAYLOAD"] immediately after the payload is read

Once the new app is up, you should be able to send requests to it via our API, Claude Code has all the right skills in the repo to figure this out. As is this code isn't testing our launcher but more of a smoke test, so need to do something like this

Gaps

A big omission that I think would be key to showcasing the platform is having strong resource isolation guarantees, if a node has 8 gpus then we need the ability to queue 8 concurrent jobs where each job has 1/8 of the total cpu cores, ram resources. This might be implicit in the machine setup but figured it's important enough to discuss here. When I clicked on the runner in the UI I only saw it mention 1 GPU
Ideally what I'd really like to see is some self serve instructions for how we expect to onboard new machines, Github runners for instance make you wget a script and then do a .run.sh so ideally this should be as simple
We're not making it clear what dependencies run on the target machine, for instance with both the AMD and NVIDIA github workflows, we specify a requirements.txt and often a Dockerfile
It's not clear to me what workflow northflank would be running so for instance let's say we run 2 concurrent competitions, in the Github world we have a new workflow file per hardware target whereas with the current integration we do lose some flexibility
I believe profiler data is ignored now
Some tests, esp since this is new I worry we'll break it for you

ngc92 and others added 30 commits April 15, 2025 14:55

fix error handling: KernelBotError should give a more specific error …

2014f7d

…message (gpu-mode#236)

fix message (gpu-mode#237)

477ac1a

Geomean ranking (gpu-mode#238)

0d8d16d

* allow ranked to run multiple benchmarks * single-benchmark, arithmetic mean, geometric mean scoring * ruff * Fix: style --------- Co-authored-by: S1ro1 <matej.sirovatka@gmail.com>

Fix: proper error propagation on submit

5386378

Fix: simplify gpu selection in GH Runner

9999a7b

Fix: style

e266675

enable HIP templates (gpu-mode#240)

b1a216d

make get-api-url ephemeral (gpu-mode#241)

b13254a

* make get-api-url ephemeral * Fix: add rstrp --------- Co-authored-by: S1ro1 <matej.sirovatka@gmail.com>

Fix/reset works (gpu-mode#242)

1911279

* make get-api-url ephemeral * Fix: add rstrp * Fix --------- Co-authored-by: ngc92 <7938269+ngc92@users.noreply.github.com>

Fix: enable invalidated submissions (gpu-mode#243)

18e4b4d

* Fix: enable invalidated submissions * Fix: correct id

Fix: remove schedule cleanup (gpu-mode#244)

48a317c

Fix (gpu-mode#245)

17a6dba

Fix: log + correct url (gpu-mode#246)

f9f293d

* Fix: log + correct url * Fix: https

Feat: reenable validation (gpu-mode#247)

8bbebb7

Add all of rocm. (gpu-mode#250)

ac06ca1

* all of rocm * cxx=clang++ env

only trigger one github job at a time

669233e

avoid duplicate newlines

c3c247e

don't attempt to send empty message

39ce395

set correct db

e902433

prevent crash when number is missing

ebc7ff4

more robust leaderboard update: do not crash if forum thread no longe…

3d8ca4f

…r exists

only score if tests are passed

bb774da

add missing commit call

abbae02

allow force-updating a LB

9b798eb

allow printing messages for passed tests

365879e

Feat: streaming response (gpu-mode#249)

e98f257

check for existence of starter_message (gpu-mode#253)

91b4726

enable comments in HIP template

e3396da

Ngc92/fix ext (gpu-mode#255)

7c000bf

* enable comments in HIP template * same fix for file extension

Feat: more logging (gpu-mode#257)

eb5a904

* Feat: more logging * Feat: longer timeout (probably reaching gh token limits)

Mark Saroufim and others added 21 commits February 10, 2026 23:16

Pass HF_TOKEN to model workflow

862914d

Add /health endpoint to API (gpu-mode#446)

87e0888

Update API base URL environment variable

1fccae6

debug

7cbd647

fix uri

b9ee67b

Remove duplicate indexes from migration

114fb1c

idx_runs_submission_id and idx_runs_ranking already existed on prod (as idx_runs_submission_id_covering and idx_runs_leaderboard_valid). Keep only the two genuinely new indexes.

Remove unnecessary index-url for PyTorch installation

97275b5

Add Northflank to acknowledgements section

b056078

Fix uv not found: install pip and uv as root

cfb95f3

ensurepip and pip install were running as the runner user, placing uv in ~/.local/bin which sudo cannot find. Use sudo so uv installs to /usr/local/bin.

Fix PyGithub deprecation warning: use Auth.Token instead of login_or_…

871c774

…token Replace deprecated `Github(token)` with `Github(auth=Auth.Token(token))` to silence the DeprecationWarning from PyGithub.

add index for query leaderboard ranking (gpu-mode#451)

a0e0094

add index for query leaderboard ranking

Add flash-linear-attention and causal-conv1d to Helion Docker image (g…

201d25b

…pu-mode#453) These are dependencies needed by the gated_deltanet and causal_conv kernel benchmarks.

Fix CUDA version mismatch in Helion Docker image

5ce7b4a

Downgrade CUDA toolkit from 13.1 to 13.0 to match the PyTorch cu130 wheel index, and drop the torch==2.10.0 pin which was causing uv to fall back to a CUDA 12.8 wheel.

Fix causal-conv1d build: use --no-build-isolation to avoid CUDA mismatch

87f3db8

uv's build isolation installs torch from default PyPI (cu128) instead of the cu130 system torch, causing the CUDA version check to fail. Using --no-build-isolation makes the build use the system torch.

Champ-Goblem marked this pull request as draft March 4, 2026 21:48

msaroufim requested review from S1ro1, msaroufim and ngc92 and removed request for S1ro1 March 4, 2026 22:19

msaroufim closed this Jun 15, 2026

msaroufim force-pushed the main branch from ac6a635 to e97bee8 Compare June 15, 2026 04:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Northflank launcher and runner for GPU job execution#456

Add Northflank launcher and runner for GPU job execution#456
Champ-Goblem wants to merge 683 commits into
gpu-mode:mainfrom
nf-testing:feature/northflank-runner

Champ-Goblem commented Mar 4, 2026

Uh oh!

msaroufim commented Mar 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

Champ-Goblem commented Mar 4, 2026

Uh oh!

msaroufim commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

On the launcher itself

Gaps

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

msaroufim commented Mar 4, 2026 •

edited

Loading