fix(kubernetes): terminate idle computing units#6046
Conversation
|
👋 Thanks for opening this pull request, @yrenat! It looks like the pull request description doesn't quite follow our template yet:
Filling out the template helps reviewers understand and triage your contribution faster. Please edit the description to complete it. This message will disappear automatically once the template is followed. You can find the template prompts by editing the description, or see CONTRIBUTING.md for the full contribution flow. |
Automated Reviewer SuggestionsBased on the
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6046 +/- ##
============================================
- Coverage 56.82% 56.80% -0.03%
Complexity 3023 3023
============================================
Files 1126 1126
Lines 43708 43727 +19
Branches 4733 4737 +4
============================================
Hits 24837 24837
- Misses 17402 17421 +19
Partials 1469 1469
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 368 | 0.225 | 23,977/73,209/73,209 us | 🔴 +117.5% / 🔴 +385.8% |
| 🟢 | bs=100 sw=10 sl=64 | 926 | 0.565 | 104,775/127,589/127,589 us | 🟢 -20.1% / 🔴 +18.6% |
| 🟢 | bs=1000 sw=10 sl=64 | 1,094 | 0.668 | 918,884/957,384/957,384 us | 🟢 -6.2% / 🟢 -9.6% |
Baseline details
Latest main 104bcc4 from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 368 tuples/sec | 433 tuples/sec | 777.62 tuples/sec | -15.0% | -52.7% |
| bs=10 sw=10 sl=64 | MB/s | 0.225 MB/s | 0.264 MB/s | 0.475 MB/s | -14.8% | -52.6% |
| bs=10 sw=10 sl=64 | p50 | 23,977 us | 21,815 us | 12,612 us | +9.9% | +90.1% |
| bs=10 sw=10 sl=64 | p95 | 73,209 us | 33,655 us | 15,070 us | +117.5% | +385.8% |
| bs=10 sw=10 sl=64 | p99 | 73,209 us | 33,655 us | 18,360 us | +117.5% | +298.7% |
| bs=100 sw=10 sl=64 | throughput | 926 tuples/sec | 909 tuples/sec | 988.31 tuples/sec | +1.9% | -6.3% |
| bs=100 sw=10 sl=64 | MB/s | 0.565 MB/s | 0.555 MB/s | 0.603 MB/s | +1.8% | -6.3% |
| bs=100 sw=10 sl=64 | p50 | 104,775 us | 104,560 us | 101,066 us | +0.2% | +3.7% |
| bs=100 sw=10 sl=64 | p95 | 127,589 us | 159,743 us | 107,594 us | -20.1% | +18.6% |
| bs=100 sw=10 sl=64 | p99 | 127,589 us | 159,743 us | 115,830 us | -20.1% | +10.2% |
| bs=1000 sw=10 sl=64 | throughput | 1,094 tuples/sec | 1,079 tuples/sec | 1,019 tuples/sec | +1.4% | +7.3% |
| bs=1000 sw=10 sl=64 | MB/s | 0.668 MB/s | 0.658 MB/s | 0.622 MB/s | +1.5% | +7.4% |
| bs=1000 sw=10 sl=64 | p50 | 918,884 us | 918,865 us | 986,982 us | +0.0% | -6.9% |
| bs=1000 sw=10 sl=64 | p95 | 957,384 us | 1,020,247 us | 1,028,491 us | -6.2% | -6.9% |
| bs=1000 sw=10 sl=64 | p99 | 957,384 us | 1,020,247 us | 1,058,493 us | -6.2% | -9.6% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,543.26,200,128000,368,0.225,23976.79,73208.97,73208.97
1,100,10,64,20,2159.88,2000,1280000,926,0.565,104775.10,127589.32,127589.32
2,1000,10,64,20,18282.56,20000,12800000,1094,0.668,918884.38,957383.59,957383.59
What changes were proposed in this PR?
This PR adds backend-side cleanup for idle Kubernetes computing units.
The main change is a scheduled cleanup task in the computing unit managing service that periodically scans active Kubernetes computing units and terminates units that have been inactive longer than a configurable timeout.
The implementation includes the following changes:
ComputingUnitManagingServicethat runs the idle cleanup logic at a fixed interval.ComputingUnitManagingResource:The timeout and check interval are configurable through environment variables, so the behavior can be tuned for different deployment or testing needs without modifying the code.
Any related issues, documentation, discussions?
Fixes #5362
How was this PR tested?
Tested locally on the Kubernetes deployment flow.
fix-idle-CU-demo.mp4
Was this PR authored or co-authored using generative AI tooling?
Generated-by: OpenAI Codex GPT-5