Skip to content

fix(kubernetes): terminate idle computing units#6046

Open
yrenat wants to merge 1 commit into
apache:mainfrom
yrenat:fix/idle-kubernetes-cus
Open

fix(kubernetes): terminate idle computing units#6046
yrenat wants to merge 1 commit into
apache:mainfrom
yrenat:fix/idle-kubernetes-cus

Conversation

@yrenat

@yrenat yrenat commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

This PR adds backend-side cleanup for idle Kubernetes computing units.

The main change is a scheduled cleanup task in the computing unit managing service that periodically scans active Kubernetes computing units and terminates units that have been inactive longer than a configurable timeout.

The implementation includes the following changes:

  • Added new Kubernetes configuration entries for:
    • computing unit idle timeout
    • computing unit idle check interval
  • Exposed both settings through environment-variable-based configuration so deployment-side overrides can be applied without code changes.
  • Added a scheduled background task in ComputingUnitManagingService that runs the idle cleanup logic at a fixed interval.
  • Added idle Kubernetes computing unit termination logic in ComputingUnitManagingResource:
    • only considers Kubernetes computing units that are not already terminated
    • checks whether the computing unit has any active workflow executions
    • computes the latest execution activity timestamp from existing execution metadata
    • terminates the Kubernetes pod when the computing unit is considered idle past the configured timeout
    • updates the computing unit termination time in the database after cleanup

The timeout and check interval are configurable through environment variables, so the behavior can be tuned for different deployment or testing needs without modifying the code.

Any related issues, documentation, discussions?

Fixes #5362

How was this PR tested?

Tested locally on the Kubernetes deployment flow.

fix-idle-CU-demo.mp4

Was this PR authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex GPT-5

@github-actions github-actions Bot added fix common platform Non-amber Scala service paths labels Jul 1, 2026
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

👋 Thanks for opening this pull request, @yrenat!

It looks like the pull request description doesn't quite follow our template yet:

  • The What changes were proposed in this PR? section is missing; please keep the template's headings.
  • The How was this PR tested? section is missing; please keep the template's headings.
  • The Was this PR authored or co-authored using generative AI tooling? section is missing; please keep the template's headings.

Filling out the template helps reviewers understand and triage your contribution faster. Please edit the description to complete it. This message will disappear automatically once the template is followed.

You can find the template prompts by editing the description, or see CONTRIBUTING.md for the full contribution flow.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • Contributors with relevant context: @Ma77Ball, @aicam
    You can notify them by mentioning @Ma77Ball, @aicam in a comment.

@codecov-commenter

codecov-commenter commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.80%. Comparing base (1a58433) to head (1b66a97).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
.../texera/service/ComputingUnitManagingService.scala 0.00% 11 Missing ⚠️
...rvice/resource/ComputingUnitManagingResource.scala 0.00% 8 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #6046      +/-   ##
============================================
- Coverage     56.82%   56.80%   -0.03%     
  Complexity     3023     3023              
============================================
  Files          1126     1126              
  Lines         43708    43727      +19     
  Branches       4733     4737       +4     
============================================
  Hits          24837    24837              
- Misses        17402    17421      +19     
  Partials       1469     1469              
Flag Coverage Δ *Carryforward flag
access-control-service 70.00% <ø> (ø)
agent-service 44.59% <ø> (ø) Carriedforward from 1a58433
amber 58.64% <ø> (ø) Carriedforward from 1a58433
computing-unit-managing-service 0.00% <0.00%> (ø)
config-service 52.30% <ø> (ø)
file-service 62.81% <ø> (ø)
frontend 49.97% <ø> (ø) Carriedforward from 1a58433
notebook-migration-service 78.57% <ø> (ø)
pyamber 90.20% <ø> (ø) Carriedforward from 1a58433
python 90.76% <ø> (ø) Carriedforward from 1a58433
workflow-compiling-service 55.14% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 4 better · 🔴 5 worse · ⚪ 6 noise (<±5%) · 0 without baseline

Compared against main 104bcc4 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🔴 bs=10 sw=10 sl=64 368 0.225 23,977/73,209/73,209 us 🔴 +117.5% / 🔴 +385.8%
🟢 bs=100 sw=10 sl=64 926 0.565 104,775/127,589/127,589 us 🟢 -20.1% / 🔴 +18.6%
🟢 bs=1000 sw=10 sl=64 1,094 0.668 918,884/957,384/957,384 us 🟢 -6.2% / 🟢 -9.6%
Baseline details

Latest main 104bcc4 from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 368 tuples/sec 433 tuples/sec 777.62 tuples/sec -15.0% -52.7%
bs=10 sw=10 sl=64 MB/s 0.225 MB/s 0.264 MB/s 0.475 MB/s -14.8% -52.6%
bs=10 sw=10 sl=64 p50 23,977 us 21,815 us 12,612 us +9.9% +90.1%
bs=10 sw=10 sl=64 p95 73,209 us 33,655 us 15,070 us +117.5% +385.8%
bs=10 sw=10 sl=64 p99 73,209 us 33,655 us 18,360 us +117.5% +298.7%
bs=100 sw=10 sl=64 throughput 926 tuples/sec 909 tuples/sec 988.31 tuples/sec +1.9% -6.3%
bs=100 sw=10 sl=64 MB/s 0.565 MB/s 0.555 MB/s 0.603 MB/s +1.8% -6.3%
bs=100 sw=10 sl=64 p50 104,775 us 104,560 us 101,066 us +0.2% +3.7%
bs=100 sw=10 sl=64 p95 127,589 us 159,743 us 107,594 us -20.1% +18.6%
bs=100 sw=10 sl=64 p99 127,589 us 159,743 us 115,830 us -20.1% +10.2%
bs=1000 sw=10 sl=64 throughput 1,094 tuples/sec 1,079 tuples/sec 1,019 tuples/sec +1.4% +7.3%
bs=1000 sw=10 sl=64 MB/s 0.668 MB/s 0.658 MB/s 0.622 MB/s +1.5% +7.4%
bs=1000 sw=10 sl=64 p50 918,884 us 918,865 us 986,982 us +0.0% -6.9%
bs=1000 sw=10 sl=64 p95 957,384 us 1,020,247 us 1,028,491 us -6.2% -6.9%
bs=1000 sw=10 sl=64 p99 957,384 us 1,020,247 us 1,058,493 us -6.2% -9.6%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,543.26,200,128000,368,0.225,23976.79,73208.97,73208.97
1,100,10,64,20,2159.88,2000,1280000,926,0.565,104775.10,127589.32,127589.32
2,1000,10,64,20,18282.56,20000,12800000,1094,0.668,918884.38,957383.59,957383.59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common fix platform Non-amber Scala service paths

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automatically terminate idle Computing Units to reclaim cluster resources

2 participants