Skip to content

Automatically terminate idle Computing Units to reclaim cluster resources #5362

Description

@kunwp1

Task Summary

During the dkNET-AI launch, we noticed that a computing unit keeps running when a user leaves the platform without terminating it. Because CUs are per-user compute pods, these idle CUs hold CPU/memory and pin their EKS nodes causing significant resource underutilization and cost.

We need to (1) define what makes a CU "idle" and (2) add a mechanism that automatically terminates idle CUs.

Based on @chenlica's and my investigation, Kubernetes has no built-in mechanism to terminate a pod for inactivity. It automatically stops pods for health/resource/lifecycle reasons (eviction, OOM, node failure, activeDeadlineSeconds), but never simply because a workload is idle.

Related links:
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
https://cloud.google.com/blog/products/containers-kubernetes/scale-to-zero-on-gke-with-keda
https://kubernetes.io/docs/concepts/workloads/autoscaling/horizontal-pod-autoscale/
https://kubernetes.io/docs/concepts/workloads/controllers/job/

Task Type

  • Refactor / Cleanup
  • DevOps / Deployment / CI
  • Testing / QA
  • Documentation
  • Performance
  • Other

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Fields

No fields configured for Task.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions