Skip to content

Add Northflank launcher and runner for GPU job execution#456

Closed
Champ-Goblem wants to merge 683 commits into
gpu-mode:mainfrom
nf-testing:feature/northflank-runner
Closed

Add Northflank launcher and runner for GPU job execution#456
Champ-Goblem wants to merge 683 commits into
gpu-mode:mainfrom
nf-testing:feature/northflank-runner

Conversation

@Champ-Goblem

Copy link
Copy Markdown

Implement Northflank integration for running kernel benchmarks on managed GPU infrastructure with object storage result delivery.

Files:

  • northflank-runner.py: Container entrypoint that parses compressed config from env vars, executes benchmarks, and uploads results to object storage for retrieval
  • northflank.py: NorthflankLauncher that triggers jobs via REST API, polls for completion, and downloads results from storage

Features:

  • Configurable repo URL and branch for testing
  • Timeout management based on submission mode
  • Compressed payload encoding for config transfer
  • Environment-based storage configuration

ngc92 and others added 30 commits April 15, 2025 14:55
* allow ranked to run multiple benchmarks

* single-benchmark, arithmetic mean, geometric mean scoring

* ruff

* Fix: style

---------

Co-authored-by: S1ro1 <matej.sirovatka@gmail.com>
* make get-api-url ephemeral

* Fix: add rstrp

---------

Co-authored-by: S1ro1 <matej.sirovatka@gmail.com>
* make get-api-url ephemeral

* Fix: add rstrp

* Fix

---------

Co-authored-by: ngc92 <7938269+ngc92@users.noreply.github.com>
* Fix: enable invalidated submissions

* Fix: correct id
* Fix: log + correct url

* Fix: https
* all of rocm

* cxx=clang++ env
* enable comments in HIP template

* same fix for file extension
* Feat: more logging

* Feat: longer timeout (probably reaching gh token limits)
Mark Saroufim and others added 21 commits February 10, 2026 23:16
* Add optional leaderboard_name filter to admin show-stats

Support filtering stats by a specific leaderboard in both the Discord
command and the API endpoint. Defaults to all leaderboards when omitted.

* Extract _stats_filter helper and qualify column references

Deduplicate JOIN/WHERE/params construction across stats queries into a
shared _stats_filter staticmethod. Alias leaderboard.submission as s in
_generate_submission_stats and qualify all column references.

* Add Query description for leaderboard_name and use flexible assertions

Add FastAPI Query metadata so leaderboard_name shows a description in
OpenAPI docs. Use call_args-based assertions in tests instead of
positional assert_called_once_with to decouple from argument passing style.
…pu-mode#447)

Bumps the uv group with 1 update in the / directory: [sqlparse](https://github.com/andialbrecht/sqlparse).


Updates `sqlparse` from 0.5.3 to 0.5.4
- [Changelog](https://github.com/andialbrecht/sqlparse/blob/master/CHANGELOG)
- [Commits](andialbrecht/sqlparse@0.5.3...0.5.4)

---
updated-dependencies:
- dependency-name: sqlparse
  dependency-version: 0.5.4
  dependency-type: indirect
  dependency-group: uv
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Adds 4 indexes to speed up ranking and user queries:
- runs(submission_id) for ranking JOINs
- submission(leaderboard_id) for leaderboard filtering
- submission(user_id) for user lookups
- runs(submission_id, runner, score) partial index for ranking computation
idx_runs_submission_id and idx_runs_ranking already existed on prod
(as idx_runs_submission_id_covering and idx_runs_leaderboard_valid).
Keep only the two genuinely new indexes.
* Add Helion Docker image with auto-publish workflow

Dockerfile based on modal_runner.py deps, focused on torch + helion.
Other packages (tinygrad, CUTLASS, cupynumeric, etc.) commented out for
easy re-enabling. GitHub Actions workflow builds and pushes to
ghcr.io/gpu-mode/helion-runner on changes to the Dockerfile.

* Bump torch to 2.10.0 in Helion Dockerfile
Switch from nvidia/cuda base to ghcr.io/actions/actions-runner:latest
so the image includes the GitHub Actions runner agent, matching the
AMD runner pattern. Install CUDA 13.1 toolkit from NVIDIA apt repos
and add sudo to all install commands for the runner user.
ensurepip and pip install were running as the runner user, placing uv
in ~/.local/bin which sudo cannot find. Use sudo so uv installs to
/usr/local/bin.
…token

Replace deprecated `Github(token)` with `Github(auth=Auth.Token(token))`
to silence the DeprecationWarning from PyGithub.
add index for query leaderboard ranking
…pu-mode#453)

These are dependencies needed by the gated_deltanet and causal_conv
kernel benchmarks.
Downgrade CUDA toolkit from 13.1 to 13.0 to match the PyTorch cu130
wheel index, and drop the torch==2.10.0 pin which was causing uv to
fall back to a CUDA 12.8 wheel.
uv's build isolation installs torch from default PyPI (cu128) instead
of the cu130 system torch, causing the CUDA version check to fail.
Using --no-build-isolation makes the build use the system torch.
* Add MI355X GPU support for AMD GitHub runner

Add MI355X to GitHubGPU enum, GPU_TO_SM mapping, and github launcher
runner routing with runner label mia1-p02-g29.

* Use amd-runner Docker container for MI355X workflow

Add container image ghcr.io/gpu-mode/amd-runner:main with GPU device
passthrough to amd_workflow.yml. Add numpy to AMD_REQUIREMENTS.

* Update AMD Dockerfile: ROCm 7.2, latest aiter, remove multi-GPU deps

- Upgrade ROCm from 6.3.1 to 7.2
- Upgrade PyTorch to nightly rocm7.2
- Update aiter to latest commit (f3be04a) for recent FP4 kernel APIs
- Remove UCX, OpenMPI, and rocSHMEM builds (no longer needed)

* Update AMD_REQUIREMENTS to use ROCm 7.2 nightly index

* Fix container permissions: run as root for GitHub Actions compatibility

* Revert "Update AMD_REQUIREMENTS to use ROCm 7.2 nightly index"

This reverts commit bb5f2ee.

* Revert "Update AMD Dockerfile: ROCm 7.2, latest aiter, remove multi-GPU deps"

This reverts commit bdc4523.

* Simplify AMD workflow for MI355X: use container deps, skip requirements install

* Reapply "Update AMD Dockerfile: ROCm 7.2, latest aiter, remove multi-GPU deps"

This reverts commit e09a2cd.

* Update AMD Dockerfile to ROCm 7.1 stable, latest aiter, remove multi-GPU deps

- Upgrade ROCm from 6.3.1 to 7.1 (stable, matches host ROCm 7.0.1)
- Use stable torch 2.10.0+rocm7.1 instead of nightly
- Update aiter to latest commit (f3be04a) for recent FP4 kernel APIs
- Remove UCX, OpenMPI, and rocSHMEM builds

* Use mia1-p02-g29 runner to build AMD Docker image

* Add workspace cleanup step before checkout in AMD Docker build

Fixes EACCES errors from root-owned files left by previous container runs.

* Remove workspace cleanup step from AMD Docker build

* Use GITHUB_TOKEN instead of PUBLISH_TOKEN for ghcr.io login

* Fix Dockerfile for Ubuntu 24.04 (Noble) base image

- Replace python3.10 packages with python3 equivalents
- Use noble ROCm package instead of jammy
- Add --break-system-packages for pip on Noble
- Remove git-core PPA (not needed on Noble)
- Remove linux-headers install (not available during build)

* Remove pip upgrade step (incompatible with Noble system pip)

* Use amd-runner:mi355 Docker image with working aiter + ROCm

* Fix pip install: add --break-system-packages for container environment

* Update amd-docker.Dockerfile

* Set minimum GitHub timeout to DEFAULT_GITHUB_TIMEOUT_MINUTES

Ensures the workflow timeout is at least 30 minutes to account for
Docker image pulls and container initialization on new runners.
Implement Northflank integration for running kernel benchmarks on
managed GPU infrastructure with object storage result delivery.

Files:
- northflank-runner.py: Container entrypoint that parses compressed
  config from env vars, executes benchmarks, and uploads results to
  object storage for retrieval
- northflank.py: NorthflankLauncher that triggers jobs via REST API,
  polls for completion, and downloads results from storage

Features:
- Configurable repo URL and branch for testing
- Timeout management based on submission mode
- Compressed payload encoding for config transfer
- Environment-based storage configuration

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Champ-Goblem <cameron@northflank.com>
@Champ-Goblem Champ-Goblem marked this pull request as draft March 4, 2026 21:48
@msaroufim

msaroufim commented Mar 4, 2026

Copy link
Copy Markdown
Member

Thanks @Champ-Goblem! This is a good first pass but to make it ready there are a few missing things

On the launcher itself

  1. In src/kernelbot/main.py and src/libkernelbot/consts.py you still need to add the northflank gpus and northflank backend respectively otherwise I'm not sure this actually tests a northflank launcher e2e
  2. In src/kernelbot/env.py we want to be adding the northflank env variables as well
  3. We are also leaking information on the benchmark infra and want to be calling del os.environ["PAYLOAD"] immediately after the payload is read

Once the new app is up, you should be able to send requests to it via our API, Claude Code has all the right skills in the repo to figure this out. As is this code isn't testing our launcher but more of a smoke test, so need to do something like this

Gaps

  1. A big omission that I think would be key to showcasing the platform is having strong resource isolation guarantees, if a node has 8 gpus then we need the ability to queue 8 concurrent jobs where each job has 1/8 of the total cpu cores, ram resources. This might be implicit in the machine setup but figured it's important enough to discuss here. When I clicked on the runner in the UI I only saw it mention 1 GPU
  2. Ideally what I'd really like to see is some self serve instructions for how we expect to onboard new machines, Github runners for instance make you wget a script and then do a .run.sh so ideally this should be as simple
  3. We're not making it clear what dependencies run on the target machine, for instance with both the AMD and NVIDIA github workflows, we specify a requirements.txt and often a Dockerfile
  4. It's not clear to me what workflow northflank would be running so for instance let's say we run 2 concurrent competitions, in the Github world we have a new workflow file per hardware target whereas with the current integration we do lose some flexibility
  5. I believe profiler data is ignored now
  6. Some tests, esp since this is new I worry we'll break it for you

@msaroufim msaroufim requested review from S1ro1, msaroufim and ngc92 and removed request for S1ro1 March 4, 2026 22:19
@msaroufim msaroufim closed this Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.