refactor(workflow-operator): extract shared Sklearn descriptor base#6048
refactor(workflow-operator): extract shared Sklearn descriptor base#6048Ma77Ball wants to merge 2 commits into
Conversation
…ssifieropdesc and sklearntrainingopdesc
Automated Reviewer SuggestionsBased on the
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6048 +/- ##
============================================
- Coverage 56.90% 56.89% -0.01%
+ Complexity 3024 3021 -3
============================================
Files 1129 1130 +1
Lines 43794 43787 -7
Branches 4743 4743
============================================
- Hits 24920 24913 -7
Misses 17399 17399
Partials 1475 1475
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 452 | 0.276 | 22,114/28,815/28,815 us | 🔴 +10.9% / 🔴 +94.9% |
| 🔴 | bs=100 sw=10 sl=64 | 935 | 0.57 | 104,162/158,675/158,675 us | 🔴 +26.4% / 🔴 +48.4% |
| ⚪ | bs=1000 sw=10 sl=64 | 1,105 | 0.675 | 907,912/949,980/949,980 us | ⚪ within ±5% / 🟢 -9.9% |
Baseline details
Latest main baca3f9 from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 452 tuples/sec | 466 tuples/sec | 786.27 tuples/sec | -3.0% | -42.5% |
| bs=10 sw=10 sl=64 | MB/s | 0.276 MB/s | 0.284 MB/s | 0.48 MB/s | -2.8% | -42.5% |
| bs=10 sw=10 sl=64 | p50 | 22,114 us | 19,942 us | 12,495 us | +10.9% | +77.0% |
| bs=10 sw=10 sl=64 | p95 | 28,815 us | 30,514 us | 14,784 us | -5.6% | +94.9% |
| bs=10 sw=10 sl=64 | p99 | 28,815 us | 30,514 us | 18,468 us | -5.6% | +56.0% |
| bs=100 sw=10 sl=64 | throughput | 935 tuples/sec | 994 tuples/sec | 991.49 tuples/sec | -5.9% | -5.7% |
| bs=100 sw=10 sl=64 | MB/s | 0.57 MB/s | 0.607 MB/s | 0.605 MB/s | -6.1% | -5.8% |
| bs=100 sw=10 sl=64 | p50 | 104,162 us | 97,253 us | 100,929 us | +7.1% | +3.2% |
| bs=100 sw=10 sl=64 | p95 | 158,675 us | 125,580 us | 106,894 us | +26.4% | +48.4% |
| bs=100 sw=10 sl=64 | p99 | 158,675 us | 125,580 us | 114,085 us | +26.4% | +39.1% |
| bs=1000 sw=10 sl=64 | throughput | 1,105 tuples/sec | 1,114 tuples/sec | 1,023 tuples/sec | -0.8% | +8.0% |
| bs=1000 sw=10 sl=64 | MB/s | 0.675 MB/s | 0.68 MB/s | 0.624 MB/s | -0.7% | +8.1% |
| bs=1000 sw=10 sl=64 | p50 | 907,912 us | 898,247 us | 983,835 us | +1.1% | -7.7% |
| bs=1000 sw=10 sl=64 | p95 | 949,980 us | 999,338 us | 1,023,777 us | -4.9% | -7.2% |
| bs=1000 sw=10 sl=64 | p99 | 949,980 us | 999,338 us | 1,053,883 us | -4.9% | -9.9% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,442.28,200,128000,452,0.276,22114.42,28814.86,28814.86
1,100,10,64,20,2140.04,2000,1280000,935,0.570,104161.56,158674.63,158674.63
2,1000,10,64,20,18093.38,20000,12800000,1105,0.675,907911.68,949980.25,949980.25|
|
||
| @JsonIgnore | ||
| def getImportStatements = "from sklearn.ensemble import RandomForestClassifier" | ||
| override def getImportStatements = "from sklearn.ensemble import RandomForestClassifier" |
There was a problem hiding this comment.
This string should be empty since it gets overridden anyway. The placeholder was introduced by mistake in the first place.
|
|
||
| @JsonIgnore | ||
| def getUserFriendlyModelName = "RandomForest Training" | ||
| override def getUserFriendlyModelName = "RandomForest Training" |
What changes were proposed in this PR?
SklearnModelOpDesc extends PythonOperatorDescriptorholding the four shared config fields (target,countVectorizer,text,tfidfTransformer) with their Jackson/JSON-schema annotations, the two@JsonIgnorehooks (getImportStatements,getUserFriendlyModelName), and the sharedgetOutputSchemas.SklearnClassifierOpDescandSklearnTrainingOpDescto extend the new base, so each keeps only what genuinely differs (generatePythonCode,operatorInfo), removing ~50 lines of copy-pasted, drift-prone code.Any related issues, documentation, discussions?
Closes: #6038
How was this PR tested?
SklearnOpDescRegistrySpec, which instantiates a leaf classifier, sets the inheritedtarget/countVectorizerfields, and asserts ongeneratePythonCode()output. Run it in CI viasbt "WorkflowOperator/jacoco", expect the Sklearn registry specs green.sbt "WorkflowOperator/Test/compile"(compiles clean; both descriptors resolve againstSklearnModelOpDesc).sbt "WorkflowOperator/testOnly *SklearnOpDescRegistrySpec"could not run locally because sbt-jacoco offline-instruments the Test classpath and JaCoCo 0.8.11 fails to instrument the vendoredJsonSchemaDraft.classon this JDK (pre-existing tooling issue, unrelated to this change); the spec runs in CI under theWorkflowOperator/jacocojob.Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.8 in compliance with ASF