Add Northflank launcher and runner for GPU job execution#456
Closed
Champ-Goblem wants to merge 683 commits into
Closed
Add Northflank launcher and runner for GPU job execution#456Champ-Goblem wants to merge 683 commits into
Champ-Goblem wants to merge 683 commits into
Conversation
* allow ranked to run multiple benchmarks * single-benchmark, arithmetic mean, geometric mean scoring * ruff * Fix: style --------- Co-authored-by: S1ro1 <matej.sirovatka@gmail.com>
* make get-api-url ephemeral * Fix: add rstrp --------- Co-authored-by: S1ro1 <matej.sirovatka@gmail.com>
* make get-api-url ephemeral * Fix: add rstrp * Fix --------- Co-authored-by: ngc92 <7938269+ngc92@users.noreply.github.com>
* Fix: enable invalidated submissions * Fix: correct id
* Fix: log + correct url * Fix: https
* all of rocm * cxx=clang++ env
* enable comments in HIP template * same fix for file extension
* Feat: more logging * Feat: longer timeout (probably reaching gh token limits)
* Add optional leaderboard_name filter to admin show-stats Support filtering stats by a specific leaderboard in both the Discord command and the API endpoint. Defaults to all leaderboards when omitted. * Extract _stats_filter helper and qualify column references Deduplicate JOIN/WHERE/params construction across stats queries into a shared _stats_filter staticmethod. Alias leaderboard.submission as s in _generate_submission_stats and qualify all column references. * Add Query description for leaderboard_name and use flexible assertions Add FastAPI Query metadata so leaderboard_name shows a description in OpenAPI docs. Use call_args-based assertions in tests instead of positional assert_called_once_with to decouple from argument passing style.
…pu-mode#447) Bumps the uv group with 1 update in the / directory: [sqlparse](https://github.com/andialbrecht/sqlparse). Updates `sqlparse` from 0.5.3 to 0.5.4 - [Changelog](https://github.com/andialbrecht/sqlparse/blob/master/CHANGELOG) - [Commits](andialbrecht/sqlparse@0.5.3...0.5.4) --- updated-dependencies: - dependency-name: sqlparse dependency-version: 0.5.4 dependency-type: indirect dependency-group: uv ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Adds 4 indexes to speed up ranking and user queries: - runs(submission_id) for ranking JOINs - submission(leaderboard_id) for leaderboard filtering - submission(user_id) for user lookups - runs(submission_id, runner, score) partial index for ranking computation
idx_runs_submission_id and idx_runs_ranking already existed on prod (as idx_runs_submission_id_covering and idx_runs_leaderboard_valid). Keep only the two genuinely new indexes.
* Add Helion Docker image with auto-publish workflow Dockerfile based on modal_runner.py deps, focused on torch + helion. Other packages (tinygrad, CUTLASS, cupynumeric, etc.) commented out for easy re-enabling. GitHub Actions workflow builds and pushes to ghcr.io/gpu-mode/helion-runner on changes to the Dockerfile. * Bump torch to 2.10.0 in Helion Dockerfile
Switch from nvidia/cuda base to ghcr.io/actions/actions-runner:latest so the image includes the GitHub Actions runner agent, matching the AMD runner pattern. Install CUDA 13.1 toolkit from NVIDIA apt repos and add sudo to all install commands for the runner user.
ensurepip and pip install were running as the runner user, placing uv in ~/.local/bin which sudo cannot find. Use sudo so uv installs to /usr/local/bin.
…token Replace deprecated `Github(token)` with `Github(auth=Auth.Token(token))` to silence the DeprecationWarning from PyGithub.
add index for query leaderboard ranking
…pu-mode#453) These are dependencies needed by the gated_deltanet and causal_conv kernel benchmarks.
Downgrade CUDA toolkit from 13.1 to 13.0 to match the PyTorch cu130 wheel index, and drop the torch==2.10.0 pin which was causing uv to fall back to a CUDA 12.8 wheel.
uv's build isolation installs torch from default PyPI (cu128) instead of the cu130 system torch, causing the CUDA version check to fail. Using --no-build-isolation makes the build use the system torch.
* Add MI355X GPU support for AMD GitHub runner Add MI355X to GitHubGPU enum, GPU_TO_SM mapping, and github launcher runner routing with runner label mia1-p02-g29. * Use amd-runner Docker container for MI355X workflow Add container image ghcr.io/gpu-mode/amd-runner:main with GPU device passthrough to amd_workflow.yml. Add numpy to AMD_REQUIREMENTS. * Update AMD Dockerfile: ROCm 7.2, latest aiter, remove multi-GPU deps - Upgrade ROCm from 6.3.1 to 7.2 - Upgrade PyTorch to nightly rocm7.2 - Update aiter to latest commit (f3be04a) for recent FP4 kernel APIs - Remove UCX, OpenMPI, and rocSHMEM builds (no longer needed) * Update AMD_REQUIREMENTS to use ROCm 7.2 nightly index * Fix container permissions: run as root for GitHub Actions compatibility * Revert "Update AMD_REQUIREMENTS to use ROCm 7.2 nightly index" This reverts commit bb5f2ee. * Revert "Update AMD Dockerfile: ROCm 7.2, latest aiter, remove multi-GPU deps" This reverts commit bdc4523. * Simplify AMD workflow for MI355X: use container deps, skip requirements install * Reapply "Update AMD Dockerfile: ROCm 7.2, latest aiter, remove multi-GPU deps" This reverts commit e09a2cd. * Update AMD Dockerfile to ROCm 7.1 stable, latest aiter, remove multi-GPU deps - Upgrade ROCm from 6.3.1 to 7.1 (stable, matches host ROCm 7.0.1) - Use stable torch 2.10.0+rocm7.1 instead of nightly - Update aiter to latest commit (f3be04a) for recent FP4 kernel APIs - Remove UCX, OpenMPI, and rocSHMEM builds * Use mia1-p02-g29 runner to build AMD Docker image * Add workspace cleanup step before checkout in AMD Docker build Fixes EACCES errors from root-owned files left by previous container runs. * Remove workspace cleanup step from AMD Docker build * Use GITHUB_TOKEN instead of PUBLISH_TOKEN for ghcr.io login * Fix Dockerfile for Ubuntu 24.04 (Noble) base image - Replace python3.10 packages with python3 equivalents - Use noble ROCm package instead of jammy - Add --break-system-packages for pip on Noble - Remove git-core PPA (not needed on Noble) - Remove linux-headers install (not available during build) * Remove pip upgrade step (incompatible with Noble system pip) * Use amd-runner:mi355 Docker image with working aiter + ROCm * Fix pip install: add --break-system-packages for container environment * Update amd-docker.Dockerfile * Set minimum GitHub timeout to DEFAULT_GITHUB_TIMEOUT_MINUTES Ensures the workflow timeout is at least 30 minutes to account for Docker image pulls and container initialization on new runners.
Implement Northflank integration for running kernel benchmarks on managed GPU infrastructure with object storage result delivery. Files: - northflank-runner.py: Container entrypoint that parses compressed config from env vars, executes benchmarks, and uploads results to object storage for retrieval - northflank.py: NorthflankLauncher that triggers jobs via REST API, polls for completion, and downloads results from storage Features: - Configurable repo URL and branch for testing - Timeout management based on submission mode - Compressed payload encoding for config transfer - Environment-based storage configuration Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Champ-Goblem <cameron@northflank.com>
Member
|
Thanks @Champ-Goblem! This is a good first pass but to make it ready there are a few missing things On the launcher itself
Once the new app is up, you should be able to send requests to it via our API, Claude Code has all the right skills in the repo to figure this out. As is this code isn't testing our launcher but more of a smoke test, so need to do something like this Gaps
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implement Northflank integration for running kernel benchmarks on managed GPU infrastructure with object storage result delivery.
Files:
Features: