Skip to content

refactor(workflow-operator): extract shared Sklearn descriptor base#6048

Open
Ma77Ball wants to merge 2 commits into
apache:mainfrom
Ma77Ball:refactor/sklearn-shared-base
Open

refactor(workflow-operator): extract shared Sklearn descriptor base#6048
Ma77Ball wants to merge 2 commits into
apache:mainfrom
Ma77Ball:refactor/sklearn-shared-base

Conversation

@Ma77Ball

@Ma77Ball Ma77Ball commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

  • Add abstract SklearnModelOpDesc extends PythonOperatorDescriptor holding the four shared config fields (target, countVectorizer, text, tfidfTransformer) with their Jackson/JSON-schema annotations, the two @JsonIgnore hooks (getImportStatements, getUserFriendlyModelName), and the shared getOutputSchemas.
  • Change SklearnClassifierOpDesc and SklearnTrainingOpDesc to extend the new base, so each keeps only what genuinely differs (generatePythonCode, operatorInfo), removing ~50 lines of copy-pasted, drift-prone code.
  • The 51 leaf subclasses inherit the fields unchanged, so the generated operator schema is identical (no behavior change).

Any related issues, documentation, discussions?

Closes: #6038

How was this PR tested?

  • Pure refactor with no behavior change; covered by existing SklearnOpDescRegistrySpec, which instantiates a leaf classifier, sets the inherited target/countVectorizer fields, and asserts on generatePythonCode() output. Run it in CI via sbt "WorkflowOperator/jacoco", expect the Sklearn registry specs green.
  • Verified locally with sbt "WorkflowOperator/Test/compile" (compiles clean; both descriptors resolve against SklearnModelOpDesc).
  • Note: sbt "WorkflowOperator/testOnly *SklearnOpDescRegistrySpec" could not run locally because sbt-jacoco offline-instruments the Test classpath and JaCoCo 0.8.11 fails to instrument the vendored JsonSchemaDraft.class on this JDK (pre-existing tooling issue, unrelated to this change); the spec runs in CI under the WorkflowOperator/jacoco job.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

@github-actions github-actions Bot added refactor Refactor the code common labels Jul 1, 2026
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • Contributors with relevant context: @aglinxinyuan, @carloea2
    You can notify them by mentioning @aglinxinyuan, @carloea2 in a comment.

@codecov-commenter

codecov-commenter commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.89%. Comparing base (104bcc4) to head (fd17af0).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...ber/operator/sklearn/SklearnClassifierOpDesc.scala 33.33% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #6048      +/-   ##
============================================
- Coverage     56.90%   56.89%   -0.01%     
+ Complexity     3024     3021       -3     
============================================
  Files          1129     1130       +1     
  Lines         43794    43787       -7     
  Branches       4743     4743              
============================================
- Hits          24920    24913       -7     
  Misses        17399    17399              
  Partials       1475     1475              
Flag Coverage Δ *Carryforward flag
access-control-service 70.00% <ø> (ø)
agent-service 44.59% <ø> (ø) Carriedforward from 104bcc4
amber 58.73% <86.66%> (-0.02%) ⬇️
computing-unit-managing-service 0.00% <ø> (ø)
config-service 52.30% <ø> (ø)
file-service 62.81% <ø> (ø)
frontend 50.06% <ø> (ø) Carriedforward from 104bcc4
notebook-migration-service 78.57% <ø> (ø)
pyamber 90.20% <ø> (ø) Carriedforward from 104bcc4
python 90.76% <ø> (ø) Carriedforward from 104bcc4
workflow-compiling-service 55.14% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 2 better · 🔴 6 worse · ⚪ 7 noise (<±5%) · 0 without baseline

Compared against main baca3f9 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🔴 bs=10 sw=10 sl=64 452 0.276 22,114/28,815/28,815 us 🔴 +10.9% / 🔴 +94.9%
🔴 bs=100 sw=10 sl=64 935 0.57 104,162/158,675/158,675 us 🔴 +26.4% / 🔴 +48.4%
bs=1000 sw=10 sl=64 1,105 0.675 907,912/949,980/949,980 us ⚪ within ±5% / 🟢 -9.9%
Baseline details

Latest main baca3f9 from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 452 tuples/sec 466 tuples/sec 786.27 tuples/sec -3.0% -42.5%
bs=10 sw=10 sl=64 MB/s 0.276 MB/s 0.284 MB/s 0.48 MB/s -2.8% -42.5%
bs=10 sw=10 sl=64 p50 22,114 us 19,942 us 12,495 us +10.9% +77.0%
bs=10 sw=10 sl=64 p95 28,815 us 30,514 us 14,784 us -5.6% +94.9%
bs=10 sw=10 sl=64 p99 28,815 us 30,514 us 18,468 us -5.6% +56.0%
bs=100 sw=10 sl=64 throughput 935 tuples/sec 994 tuples/sec 991.49 tuples/sec -5.9% -5.7%
bs=100 sw=10 sl=64 MB/s 0.57 MB/s 0.607 MB/s 0.605 MB/s -6.1% -5.8%
bs=100 sw=10 sl=64 p50 104,162 us 97,253 us 100,929 us +7.1% +3.2%
bs=100 sw=10 sl=64 p95 158,675 us 125,580 us 106,894 us +26.4% +48.4%
bs=100 sw=10 sl=64 p99 158,675 us 125,580 us 114,085 us +26.4% +39.1%
bs=1000 sw=10 sl=64 throughput 1,105 tuples/sec 1,114 tuples/sec 1,023 tuples/sec -0.8% +8.0%
bs=1000 sw=10 sl=64 MB/s 0.675 MB/s 0.68 MB/s 0.624 MB/s -0.7% +8.1%
bs=1000 sw=10 sl=64 p50 907,912 us 898,247 us 983,835 us +1.1% -7.7%
bs=1000 sw=10 sl=64 p95 949,980 us 999,338 us 1,023,777 us -4.9% -7.2%
bs=1000 sw=10 sl=64 p99 949,980 us 999,338 us 1,053,883 us -4.9% -9.9%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,442.28,200,128000,452,0.276,22114.42,28814.86,28814.86
1,100,10,64,20,2140.04,2000,1280000,935,0.570,104161.56,158674.63,158674.63
2,1000,10,64,20,18093.38,20000,12800000,1105,0.675,907911.68,949980.25,949980.25

@aglinxinyuan aglinxinyuan requested review from aglinxinyuan and Copilot and removed request for Copilot July 2, 2026 02:29

@JsonIgnore
def getImportStatements = "from sklearn.ensemble import RandomForestClassifier"
override def getImportStatements = "from sklearn.ensemble import RandomForestClassifier"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This string should be empty since it gets overridden anyway. The placeholder was introduced by mistake in the first place.


@JsonIgnore
def getUserFriendlyModelName = "RandomForest Training"
override def getUserFriendlyModelName = "RandomForest Training"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common refactor Refactor the code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deduplicate Sklearn classifier and training operator descriptors via a shared base

3 participants