refactor(workflow-operator): extract shared Sklearn descriptor base by Ma77Ball · Pull Request #6048 · apache/texera

Ma77Ball · 2026-07-01T18:23:33Z

What changes were proposed in this PR?

Add abstract SklearnModelOpDesc extends PythonOperatorDescriptor holding the four shared config fields (target, countVectorizer, text, tfidfTransformer) with their Jackson/JSON-schema annotations, the two @JsonIgnore hooks (getImportStatements, getUserFriendlyModelName), and the shared getOutputSchemas.
Change SklearnClassifierOpDesc and SklearnTrainingOpDesc to extend the new base, so each keeps only what genuinely differs (generatePythonCode, operatorInfo), removing ~50 lines of copy-pasted, drift-prone code.
The 51 leaf subclasses inherit the fields unchanged, so the generated operator schema is identical (no behavior change).

Any related issues, documentation, discussions?

Closes: #6038

How was this PR tested?

Pure refactor with no behavior change; covered by existing SklearnOpDescRegistrySpec, which instantiates a leaf classifier, sets the inherited target/countVectorizer fields, and asserts on generatePythonCode() output. Run it in CI via sbt "WorkflowOperator/jacoco", expect the Sklearn registry specs green.
Verified locally with sbt "WorkflowOperator/Test/compile" (compiles clean; both descriptors resolve against SklearnModelOpDesc).
Note: sbt "WorkflowOperator/testOnly *SklearnOpDescRegistrySpec" could not run locally because sbt-jacoco offline-instruments the Test classpath and JaCoCo 0.8.11 fails to instrument the vendored JsonSchemaDraft.class on this JDK (pre-existing tooling issue, unrelated to this change); the spec runs in CI under the WorkflowOperator/jacoco job.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

…ssifieropdesc and sklearntrainingopdesc

…-shared-base

github-actions · 2026-07-01T18:23:51Z

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

Contributors with relevant context: @aglinxinyuan, @carloea2
You can notify them by mentioning @aglinxinyuan, @carloea2 in a comment.

codecov-commenter · 2026-07-01T18:26:44Z

Codecov Report

❌ Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.89%. Comparing base (104bcc4) to head (fd17af0).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
...ber/operator/sklearn/SklearnClassifierOpDesc.scala	33.33%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #6048      +/-   ##
============================================
- Coverage     56.90%   56.89%   -0.01%     
+ Complexity     3024     3021       -3     
============================================
  Files          1129     1130       +1     
  Lines         43794    43787       -7     
  Branches       4743     4743              
============================================
- Hits          24920    24913       -7     
  Misses        17399    17399              
  Partials       1475     1475

Flag	Coverage Δ		*Carryforward flag
access-control-service	`70.00% <ø> (ø)`
agent-service	`44.59% <ø> (ø)`		Carriedforward from 104bcc4
amber	`58.73% <86.66%> (-0.02%)`	⬇️
computing-unit-managing-service	`0.00% <ø> (ø)`
config-service	`52.30% <ø> (ø)`
file-service	`62.81% <ø> (ø)`
frontend	`50.06% <ø> (ø)`		Carriedforward from 104bcc4
notebook-migration-service	`78.57% <ø> (ø)`
pyamber	`90.20% <ø> (ø)`		Carriedforward from 104bcc4
python	`90.76% <ø> (ø)`		Carriedforward from 104bcc4
workflow-compiling-service	`55.14% <ø> (ø)`

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2026-07-01T18:29:08Z

⚠️ Benchmark changes need a look

🟢 2 better · 🔴 6 worse · ⚪ 7 noise (<±5%) · 0 without baseline

Compared against main baca3f9 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

	config	throughput	MB/s	latency	max Δ latest / 7d
🔴	bs=10 sw=10 sl=64	452	0.276	22,114/28,815/28,815 us	🔴 +10.9% / 🔴 +94.9%
🔴	bs=100 sw=10 sl=64	935	0.57	104,162/158,675/158,675 us	🔴 +26.4% / 🔴 +48.4%
⚪	bs=1000 sw=10 sl=64	1,105	0.675	907,912/949,980/949,980 us	⚪ within ±5% / 🟢 -9.9%

Baseline details

Latest main baca3f9 from same runner

config	metric	PR	latest main	7d avg	Δ latest	Δ 7d
bs=10 sw=10 sl=64	throughput	452 tuples/sec	466 tuples/sec	786.27 tuples/sec	-3.0%	-42.5%
bs=10 sw=10 sl=64	MB/s	0.276 MB/s	0.284 MB/s	0.48 MB/s	-2.8%	-42.5%
bs=10 sw=10 sl=64	p50	22,114 us	19,942 us	12,495 us	+10.9%	+77.0%
bs=10 sw=10 sl=64	p95	28,815 us	30,514 us	14,784 us	-5.6%	+94.9%
bs=10 sw=10 sl=64	p99	28,815 us	30,514 us	18,468 us	-5.6%	+56.0%
bs=100 sw=10 sl=64	throughput	935 tuples/sec	994 tuples/sec	991.49 tuples/sec	-5.9%	-5.7%
bs=100 sw=10 sl=64	MB/s	0.57 MB/s	0.607 MB/s	0.605 MB/s	-6.1%	-5.8%
bs=100 sw=10 sl=64	p50	104,162 us	97,253 us	100,929 us	+7.1%	+3.2%
bs=100 sw=10 sl=64	p95	158,675 us	125,580 us	106,894 us	+26.4%	+48.4%
bs=100 sw=10 sl=64	p99	158,675 us	125,580 us	114,085 us	+26.4%	+39.1%
bs=1000 sw=10 sl=64	throughput	1,105 tuples/sec	1,114 tuples/sec	1,023 tuples/sec	-0.8%	+8.0%
bs=1000 sw=10 sl=64	MB/s	0.675 MB/s	0.68 MB/s	0.624 MB/s	-0.7%	+8.1%
bs=1000 sw=10 sl=64	p50	907,912 us	898,247 us	983,835 us	+1.1%	-7.7%
bs=1000 sw=10 sl=64	p95	949,980 us	999,338 us	1,023,777 us	-4.9%	-7.2%
bs=1000 sw=10 sl=64	p99	949,980 us	999,338 us	1,053,883 us	-4.9%	-9.9%

Raw CSV

config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,442.28,200,128000,452,0.276,22114.42,28814.86,28814.86
1,100,10,64,20,2140.04,2000,1280000,935,0.570,104161.56,158674.63,158674.63
2,1000,10,64,20,18093.38,20000,12800000,1105,0.675,907911.68,949980.25,949980.25

aglinxinyuan · 2026-07-02T02:31:29Z


-  @JsonIgnore
-  def getImportStatements = "from sklearn.ensemble import RandomForestClassifier"
+  override def getImportStatements = "from sklearn.ensemble import RandomForestClassifier"


This string should be empty since it gets overridden anyway. The placeholder was introduced by mistake in the first place.

aglinxinyuan · 2026-07-02T02:31:34Z


-  @JsonIgnore
-  def getUserFriendlyModelName = "RandomForest Training"
+  override def getUserFriendlyModelName = "RandomForest Training"


Ma77Ball added 2 commits July 1, 2026 10:45

refactor(sklearn): moved base class to shared file between sklearncla…

42ae372

…ssifieropdesc and sklearntrainingopdesc

Merge branch 'main' of github.com:apache/texera into refactor/sklearn…

fd17af0

…-shared-base

github-actions Bot assigned Ma77Ball Jul 1, 2026

github-actions Bot added refactor Refactor the code common labels Jul 1, 2026

aglinxinyuan requested review from aglinxinyuan and Copilot and removed request for Copilot July 2, 2026 02:29

aglinxinyuan approved these changes Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(workflow-operator): extract shared Sklearn descriptor base#6048

refactor(workflow-operator): extract shared Sklearn descriptor base#6048
Ma77Ball wants to merge 2 commits into
apache:mainfrom
Ma77Ball:refactor/sklearn-shared-base

Ma77Ball commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

codecov-commenter commented Jul 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

aglinxinyuan Jul 2, 2026

Uh oh!

aglinxinyuan Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Ma77Ball commented Jul 1, 2026

What changes were proposed in this PR?

Any related issues, documentation, discussions?

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

github-actions Bot commented Jul 1, 2026

Automated Reviewer Suggestions

Uh oh!

codecov-commenter commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jul 1, 2026

⚠️ Benchmark changes need a look

Uh oh!

aglinxinyuan Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

aglinxinyuan Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Jul 1, 2026 •

edited

Loading