Skip to content

Add GPUClusterConfig CRD and controller for DRA-based stack#2571

Open
karthikvetrivel wants to merge 8 commits into
mainfrom
kv-gpuclusterconfig-crd
Open

Add GPUClusterConfig CRD and controller for DRA-based stack#2571
karthikvetrivel wants to merge 8 commits into
mainfrom
kv-gpuclusterconfig-crd

Conversation

@karthikvetrivel

Copy link
Copy Markdown
Member

1. Overview

We introduce a new CRD named GPUClusterConfig and a new controller for reconciling it. Like ClusterPolicy today, it is a singleton, cluster-scoped CRD that configures the operands needed to enable GPUs in Kubernetes. GPUClusterConfig represents the new DRA-based software-enablement stack; it is an evolution of ClusterPolicy.

Change Log

9f08bec:

  • Defined GPUClusterConfig Go types in api/nvidia/v1alpha1, cluster-scoped + singleton, with kubebuilder validation/default markers for every operand block. Wire AddToScheme. Generated the CRD manifest + deepcopy.
  • Tested: make manifests generate produces the CRD yaml and deepcopy. kubectl apply the CRD succeeds.

73c9d30:

  • Introduced a new controller built on the existing state.Manager / SyncState() engine (the same
    pattern NVIDIADriver uses), registered in cmd/gpu-operator/main.go.
  • Singleton enforcement (first-wins): a single instance owns reconciliation; any additional
    instance is marked Ignored and skipped. Mirrors how ClusterPolicy handles duplicates.

ccc0f7a:

  • Added cmd/dra-driver-validator, the init container binary for the DRA kubelet-plugin DaemonSet. It runs before the gpus and computeDomains containers start, validates that the NVIDIA driver is installed, and writes /run/nvidia/validations/driver-ready with the two env vars the kubelet-plugin containers source on startup (NVIDIA_DRIVER_ROOT, DRIVER_ROOT_CTR_PATH).
  • Tested: unit tests end-to-end with fake driver, and against a real NVIDIA 595.58.03 driver on an A100 node.

87fa6c0:

  • Adds the DRA driver operand to the GPUClusterConfig controller (a new state-dra-driver state with its manifests wired into the state manager)

aff2736

  • Adds a Helm install path for the DRA stack and makes it mutually exclusive with ClusterPolicy. Setting gpuClusterConfig.enabled=true renders a new deployments/gpu-operator/templates/gpuclusterconfig.yaml that creates the singleton gpu-cluster-config CR (full draDriver spec exposed via values, reusing the shared hostPaths/daemonsets), and gates the clusterpolicy.yaml template off so the chart never deploys both paths at once.

Moved from #2513 (re-created with the head branch on NVIDIA/gpu-operator instead of a fork, to enable stacked PRs). The earlier review discussion — including the GPUClusterConfigGPUCluster naming suggestion — lives in #2513.

Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Comment thread deployments/gpu-operator/values.yaml Outdated
Comment thread deployments/gpu-operator/templates/cleanup_crd.yaml
# NVIDIADriver CR). GPUClusterConfig does not manage the driver or device plugin
# itself; it waits for driver readiness before deploying the DRA driver.
gpuClusterConfig:
enabled: false

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should think a bit more on the right interface for this. A few questions come to mind:

  1. Is enabled the right name for this field? As currently implemented, setting gpuClusterConfig.enabled=true will create a default GPUClusterConfig CR and will NOT create a default ClusterPolicy CR when the helm chart gets rendered. This may not be clear to the user.
  2. Should the draDriver struct be embedded under the top-level gpuClusterConfig struct?

Comment thread api/nvidia/v1alpha1/gpuclusterconfig_types.go Outdated
Comment thread api/nvidia/v1alpha1/gpuclusterconfig_types.go Outdated
Comment thread api/nvidia/v1alpha1/gpuclusterconfig_types.go Outdated
Comment thread internal/state/dra_driver.go Outdated
Comment thread manifests/state-dra-driver/0500_daemonset.yaml Outdated
affinity: {{ .KubeletPluginAffinity | toJson }}
{{- else }}
affinity:
nodeAffinity:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- should we add a nodeAntiAffinity rule here to prevent the kubelet-plugin from running on a node where the k8s-device-plugin is running? E.g. don't run on nodes labeled with nvidia.com/gpu.deploy.device-plugin=true

Comment thread manifests/state-dra-driver/0500_daemonset.yaml Outdated
{{- else }}
deviceClassName: gpu.nvidia.com
allocationMode: All
adminAccess: true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- does the GPU Operator namespace have to be labeled with resource.kubernetes.io/admin-access: true for this? From https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#admin-access:

Only users authorized to create ResourceClaim or ResourceClaimTemplate objects in namespaces labeled with resource.kubernetes.io/admin-access: "true" (case-sensitive) can use the adminAccess field.

@karthikvetrivel karthikvetrivel Jun 24, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe it does (as the link you found mentioned). We already handle this.

// ensureAdminAccessLabel patches the operator namespace with the label required by the
// kube-scheduler to allow adminAccess: true in ResourceClaim/ResourceClaimTemplate
// objects. The label is deliberately never removed: it is namespace-level configuration
// that other adminAccess consumers in the namespace may rely on.
func (s *stateGFD) ensureAdminAccessLabel(ctx context.Context) error {
ns := &corev1.Namespace{}
if err := s.client.Get(ctx, client.ObjectKey{Name: s.namespace}, ns); err != nil {
return fmt.Errorf("could not get namespace %s: %w", s.namespace, err)
}
if ns.Labels[draAdminNamespaceLabelKey] == "true" {
return nil
}
patch := client.MergeFrom(ns.DeepCopy())
if ns.Labels == nil {
ns.Labels = make(map[string]string)
}
ns.Labels[draAdminNamespaceLabelKey] = "true"
return s.client.Patch(ctx, ns, patch)
}

As it exists right now, it isn't pre-baked in.

… CRD job

Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants