Skip to content

fix(amber): correct WorkerSpec input port to stop flaky timeouts#6037

Open
Ma77Ball wants to merge 3 commits into
apache:mainfrom
Ma77Ball:fix/flaky-workerspec-input-port
Open

fix(amber): correct WorkerSpec input port to stop flaky timeouts#6037
Ma77Ball wants to merge 3 commits into
apache:mainfrom
Ma77Ball:fix/flaky-workerspec-input-port

Conversation

@Ma77Ball

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

  • WorkerSpec's input-port AssignPortRequest now passes an empty storageUris (List()) matching its empty partitionings, satisfying InputManager.addPort's size invariant, so the input port registers and the three worker tests pass deterministically.
  • AsyncRPCServer.invokeMethod now re-throws Errors (e.g. failed assertions) after replying to the sender, instead of swallowing them; normal Exceptions are still returned to the sender unchanged, so genuine invariant violations surface loudly rather than degrading into flaky timeouts.

Any related issues, documentation, discussions?

Closes: #6036

How was this PR tested?

  • Run sbt "WorkflowExecutionService/testOnly *WorkerSpec"; expect Tests: succeeded 3, failed 0.
  • Diagnostic: grep the run log for assertion failed. Before this fix the count is 3 (the assertion fired and was swallowed every run); after it is 0. Verified locally under Java 17.
  • The full amber suite is heavy and flaky to run locally, so the engine-wide AsyncRPCServer change is left to the amber CI job to exercise across all RPC handlers.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

@github-actions

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • Contributors with relevant context: @Yicong-Huang
    You can notify them by mentioning @Yicong-Huang in a comment.

@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

✅ No material benchmark regressions detected

🟢 5 better · 🔴 0 worse · ⚪ 10 noise (<±5%) · 0 without baseline

Compared against main 1073b22 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🟢 bs=10 sw=10 sl=64 591 0.361 16,164/22,947/22,947 us 🟢 -10.7% / 🔴 +55.2%
🟢 bs=100 sw=10 sl=64 1,242 0.758 78,658/98,798/98,798 us 🟢 -7.0% / 🟢 +25.3%
bs=1000 sw=10 sl=64 1,405 0.858 706,210/776,519/776,519 us ⚪ within ±5% / 🟢 +37.4%
Baseline details

Latest main 1073b22 from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 591 tuples/sec 583 tuples/sec 786.27 tuples/sec +1.4% -24.8%
bs=10 sw=10 sl=64 MB/s 0.361 MB/s 0.356 MB/s 0.48 MB/s +1.4% -24.8%
bs=10 sw=10 sl=64 p50 16,164 us 17,232 us 12,495 us -6.2% +29.4%
bs=10 sw=10 sl=64 p95 22,947 us 25,697 us 14,784 us -10.7% +55.2%
bs=10 sw=10 sl=64 p99 22,947 us 25,697 us 18,468 us -10.7% +24.3%
bs=100 sw=10 sl=64 throughput 1,242 tuples/sec 1,226 tuples/sec 991.49 tuples/sec +1.3% +25.3%
bs=100 sw=10 sl=64 MB/s 0.758 MB/s 0.748 MB/s 0.605 MB/s +1.3% +25.3%
bs=100 sw=10 sl=64 p50 78,658 us 80,848 us 100,929 us -2.7% -22.1%
bs=100 sw=10 sl=64 p95 98,798 us 106,221 us 106,894 us -7.0% -7.6%
bs=100 sw=10 sl=64 p99 98,798 us 106,221 us 114,085 us -7.0% -13.4%
bs=1000 sw=10 sl=64 throughput 1,405 tuples/sec 1,440 tuples/sec 1,023 tuples/sec -2.4% +37.4%
bs=1000 sw=10 sl=64 MB/s 0.858 MB/s 0.879 MB/s 0.624 MB/s -2.4% +37.4%
bs=1000 sw=10 sl=64 p50 706,210 us 698,853 us 983,835 us +1.1% -28.2%
bs=1000 sw=10 sl=64 p95 776,519 us 741,459 us 1,023,777 us +4.7% -24.2%
bs=1000 sw=10 sl=64 p99 776,519 us 741,459 us 1,053,883 us +4.7% -26.3%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,338.22,200,128000,591,0.361,16163.87,22946.85,22946.85
1,100,10,64,20,1610.60,2000,1280000,1242,0.758,78658.20,98797.54,98797.54
2,1000,10,64,20,14231.84,20000,12800000,1405,0.858,706209.65,776519.38,776519.38

@codecov-commenter

codecov-commenter commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.87%. Comparing base (c3161f7) to head (6d39dfc).
⚠️ Report is 70 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #6037      +/-   ##
============================================
+ Coverage     54.50%   56.87%   +2.37%     
+ Complexity     2915     1901    -1014     
============================================
  Files          1108      843     -265     
  Lines         42807    34646    -8161     
  Branches       4604     3480    -1124     
============================================
- Hits          23332    19705    -3627     
+ Misses        18119    13986    -4133     
+ Partials       1356      955     -401     
Flag Coverage Δ *Carryforward flag
access-control-service 70.44% <ø> (ø) Carriedforward from 12c1107
agent-service 34.36% <ø> (ø) Carriedforward from 12c1107
amber 68.17% <ø> (+11.65%) ⬆️
computing-unit-managing-service 1.65% <ø> (ø) Carriedforward from 12c1107
config-service 57.35% <ø> (ø) Carriedforward from 12c1107
file-service 58.59% <ø> (ø) Carriedforward from 12c1107
frontend 48.27% <ø> (ø) Carriedforward from 12c1107
pyamber 90.20% <ø> (ø) Carriedforward from 12c1107
python 90.76% <ø> (ø) Carriedforward from 12c1107
workflow-compiling-service 58.69% <ø> (ø) Carriedforward from 12c1107

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Updated error handling comment for clarity.

Signed-off-by: Matthew B. <mgball@uci.edu>
@xuang7 xuang7 added the release/v1.2 back porting to release/v1.2 label Jun 30, 2026
Comment on lines +106 to +108
// Re-throw Errors (e.g. failed assertions) after replying; only
// Exceptions are returned to the sender and recovered from.
if (err.isInstanceOf[Error]) throw err

@Yicong-Huang Yicong-Huang Jul 1, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come non errors can reach here? that's a bigger problem...

with your change, this "non-error" is omitted silently...

can we at least log warning when it happens?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped the throw since it can take down the actor loop. The Error was an AssertionError from the test's mismatched addPort args; it now just logs a warning so it's not silent.

@Yicong-Huang Yicong-Huang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The src code seems to have a bigger issue. we should not make source code less robust to satisfy test cases...

@Ma77Ball

Ma77Ball commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Reverted the re-throw. Fixed the real root cause: the test passed mismatched storageUris/partitionings, now corrected in one line. AsyncRPCServer only keeps a warning log so a swallowed Error isn't silent.

@Ma77Ball Ma77Ball requested a review from Yicong-Huang July 1, 2026 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

engine fix release/v1.2 back porting to release/v1.2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky WorkerSpec: input-port AssignPortRequest violates addPort size invariant

4 participants