Diagram of GPU fairshare scheduling in Kubernetes showing guaranteed quotas feeding a time-based shared GPU pool

GPU fairshare scheduling in Kubernetes: Time‑Based Fairshare for Balanced AI Clusters

By Agustin Giovagnoli / January 28, 2026

Teams running AI at scale need predictable guarantees without wasting capacity. With GPU fairshare scheduling in Kubernetes, Run:ai virtualizes GPUs and orchestrates them through quotas and dynamic sharing so departments get what they’re owed—and the cluster stays busy [1][2].

Why balanced GPU allocation matters for AI clusters

Balanced allocation ensures teams receive guaranteed GPU access while allowing unused capacity to be redistributed fairly. Run:ai extends Kubernetes scheduling to allocate resources by guarantees first and then distribute excess through fairshare, improving fairness and utilization over time [1][2].

Two-stage allocation: guarantees first, then fairshare

Run:ai assigns GPU resources to departments or projects according to configured guaranteed quotas. Only after these guarantees are met does the scheduler distribute remaining “over‑quota” GPUs. This second stage uses a fairshare calculation so projects can consume excess capacity without violating others’ guarantees [1][2].

Stage 1: Enforce guaranteed quotas per project or department [1][2].
Stage 2: Allocate remaining GPUs proportionally based on fairshare, even to tenants without guaranteed quota [1][2].

GPU fairshare scheduling in Kubernetes: how point‑in‑time scoring works

Fairshare is computed as a point‑in‑time score each scheduling cycle and is recalculated based on a project’s relative weight and recent usage. The scheduler computes fairshare per resource type—GPU, CPU, and memory—so workloads receive a proportional share of the remaining pool over time rather than at a single instant [1][2]. This approach adapts to dynamic demand and continuously rebalances access as usage shifts [1][2].

Avoid skewed placements with atomic resource allocation

Workloads are scheduled only when all requested resources—GPU, CPU, and memory—can be allocated together. By avoiding partial availability, the scheduler prevents unbalanced placements and ensures that a job starts only when the full resource set is ready on suitable nodes [1].

Fractional GPUs via reservation pods and time-slicing

Run:ai supports GPU fractionalization using internal reservation pods along with node‑level memory splits and NVIDIA time‑slicing. Multiple containers can share a single physical GPU by dividing memory and using time‑based compute sharing, enabling fine‑grained allocation (for example, 0.5 GPU per user) while honoring quotas and fairshare [1][2]. This is particularly useful for keeping utilization high in fragmented clusters where full‑GPU requests might leave stranded capacity [1][2].

Workload classes: interactive build vs scalable train

Machine scheduling strategies separate interactive “build” workloads from large‑scale “train” workloads. Build jobs can use fixed, non‑shared GPU quotas for responsive development, while train jobs draw from a shared GPU pool governed by guarantees and fairshare. This separation improves user experience for interactive tasks and supports scalable training with policy‑driven sharing [1][2][3].

Preemption, priorities, and reclaim to enforce guarantees

The scheduler can preempt lower‑priority jobs and reclaim GPUs to enforce guarantees and maintain fairness over time. Even projects without guaranteed quota can receive proportional slices of unused capacity, but higher‑priority or guaranteed workloads can reclaim resources as needed to uphold policy [1][2][3].

Operational considerations and references

Configure guaranteed quotas and relative weights per project to shape access to shared capacity over time [1][2].
Observe fairshare behavior per resource type (GPU/CPU/memory) and align requests to reduce fragmentation [1].
Track preemption events and priorities to verify guarantees are enforced and resources are reclaimed appropriately [1][2][3].

For additional background on core scheduler concepts, consult the Kubernetes scheduler documentation (external): Kubernetes scheduling overview.

To dive deeper into the Run:ai model, see the official scheduler documentation (external): How the Scheduler Works.

Looking for hands‑on frameworks and tools? Explore AI tools and playbooks.

Sources

[1] How the Scheduler Works | Self-hosted | Run:ai Documentation
https://run-ai-docs.nvidia.com/self-hosted/platform-management/runai-scheduler/scheduling/how-the-scheduler-works

[2] Workload Management & Orchestration Series: NVIDIA Run:ai
https://www.wwt.com/blog/workload-management-and-orchestration-series-nvidia-runai

[3] Machine Scheduling for AI Workloads on GPUs
https://connect.redhat.com/hydra/prm/v1/business/companies/bf36e6f9100044ef903614234b0f70ad/linked-resources/a3276925d1334c5ba60ff74278d73443/content/public/view