test(workflow-operator): add unit test coverage for Sklearn training ensemble & linear descriptors#5955
Conversation
…ensemble & linear descriptors
Automated Reviewer SuggestionsBased on the
|
There was a problem hiding this comment.
Pull request overview
This PR adds ScalaTest unit specs under common/workflow-operator to pin the current behavior of nine previously-untested Sklearn training operator descriptors (operator metadata, default config values, output schema shape, Python codegen basics, and JSON polymorphic round-trip via LogicalOp).
Changes:
- Add 9 new
AnyFlatSpectest classes coveringSklearnTraining*OpDescdescriptors in the Sklearn training family. - Assert consistent
operatorInfocontract (group, ports, blocking output) and config defaults (countVectorizer/tfidfTransformerfalse;target/textnull). - Validate output schema fields (
model_nameSTRING,modelBINARY), basic codegen import expectations, andoperatorTypediscriminator round-tripping.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingAdaptiveBoostingOpDescSpec.scala | Adds unit coverage for AdaBoost training descriptor contract/codegen/round-trip. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingBaggingOpDescSpec.scala | Adds unit coverage for Bagging training descriptor contract/codegen/round-trip. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingGradientBoostingOpDescSpec.scala | Adds unit coverage for Gradient Boosting training descriptor contract/codegen/round-trip. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingLinearRegressionOpDescSpec.scala | Adds unit coverage for Linear Regression training descriptor contract/codegen/round-trip. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingLogisticRegressionCVOpDescSpec.scala | Adds unit coverage for LogisticRegressionCV training descriptor contract/codegen/round-trip. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingLogisticRegressionOpDescSpec.scala | Adds unit coverage for Logistic Regression training descriptor contract/codegen/round-trip. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingPassiveAggressiveOpDescSpec.scala | Adds unit coverage for Passive Aggressive training descriptor contract/codegen/round-trip. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingPerceptronOpDescSpec.scala | Adds unit coverage for Perceptron training descriptor contract/codegen/round-trip. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingSDGOpDescSpec.scala | Adds unit coverage for SGD training descriptor contract/codegen/round-trip (using existing SDG naming in codebase). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5955 +/- ##
=========================================
Coverage 54.91% 54.91%
Complexity 2956 2956
=========================================
Files 1117 1117
Lines 43133 43133
Branches 4648 4648
=========================================
+ Hits 23685 23686 +1
+ Misses 18054 18051 -3
- Partials 1394 1396 +2
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 364 | 0.222 | 26,473/37,622/37,622 us | 🔴 +11.6% / 🔴 +152.7% |
| 🟢 | bs=100 sw=10 sl=64 | 802 | 0.489 | 124,621/140,467/140,467 us | 🟢 -6.1% / 🔴 +32.2% |
| ⚪ | bs=1000 sw=10 sl=64 | 916 | 0.559 | 1,090,698/1,118,560/1,118,560 us | ⚪ within ±5% / 🔴 +12.4% |
Baseline details
Latest main 2ebfc28 from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 364 tuples/sec | 396 tuples/sec | 781.13 tuples/sec | -8.1% | -53.4% |
| bs=10 sw=10 sl=64 | MB/s | 0.222 MB/s | 0.242 MB/s | 0.477 MB/s | -8.3% | -53.4% |
| bs=10 sw=10 sl=64 | p50 | 26,473 us | 24,766 us | 12,542 us | +6.9% | +111.1% |
| bs=10 sw=10 sl=64 | p95 | 37,622 us | 33,719 us | 14,886 us | +11.6% | +152.7% |
| bs=10 sw=10 sl=64 | p99 | 37,622 us | 33,719 us | 17,580 us | +11.6% | +114.0% |
| bs=100 sw=10 sl=64 | throughput | 802 tuples/sec | 820 tuples/sec | 999.37 tuples/sec | -2.2% | -19.7% |
| bs=100 sw=10 sl=64 | MB/s | 0.489 MB/s | 0.501 MB/s | 0.61 MB/s | -2.4% | -19.8% |
| bs=100 sw=10 sl=64 | p50 | 124,621 us | 121,333 us | 99,687 us | +2.7% | +25.0% |
| bs=100 sw=10 sl=64 | p95 | 140,467 us | 149,535 us | 106,271 us | -6.1% | +32.2% |
| bs=100 sw=10 sl=64 | p99 | 140,467 us | 149,535 us | 115,445 us | -6.1% | +21.7% |
| bs=1000 sw=10 sl=64 | throughput | 916 tuples/sec | 926 tuples/sec | 1,036 tuples/sec | -1.1% | -11.6% |
| bs=1000 sw=10 sl=64 | MB/s | 0.559 MB/s | 0.565 MB/s | 0.632 MB/s | -1.1% | -11.6% |
| bs=1000 sw=10 sl=64 | p50 | 1,090,698 us | 1,082,440 us | 970,675 us | +0.8% | +12.4% |
| bs=1000 sw=10 sl=64 | p95 | 1,118,560 us | 1,103,515 us | 1,011,928 us | +1.4% | +10.5% |
| bs=1000 sw=10 sl=64 | p99 | 1,118,560 us | 1,103,515 us | 1,045,045 us | +1.4% | +7.0% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,548.96,200,128000,364,0.222,26473.10,37622.43,37622.43
1,100,10,64,20,2494.46,2000,1280000,802,0.489,124621.34,140467.21,140467.21
2,1000,10,64,20,21838.09,20000,12800000,916,0.559,1090698.21,1118560.47,1118560.47
What changes were proposed in this PR?
Pin behavior of nine previously-untested Sklearn training operator descriptors in
common/workflow-operator. No production-code changes.SklearnTrainingAdaptiveBoostingOpDescSpecSklearnTrainingAdaptiveBoostingOpDescSklearnTrainingBaggingOpDescSpecSklearnTrainingBaggingOpDescSklearnTrainingGradientBoostingOpDescSpecSklearnTrainingGradientBoostingOpDescSklearnTrainingLogisticRegressionOpDescSpecSklearnTrainingLogisticRegressionOpDescSklearnTrainingLogisticRegressionCVOpDescSpecSklearnTrainingLogisticRegressionCVOpDescSklearnTrainingPerceptronOpDescSpecSklearnTrainingPerceptronOpDescSklearnTrainingPassiveAggressiveOpDescSpecSklearnTrainingPassiveAggressiveOpDescSklearnTrainingSDGOpDescSpecSklearnTrainingSDGOpDescSklearnTrainingLinearRegressionOpDescSpecSklearnTrainingLinearRegressionOpDescBehavior pinned (shared
SklearnTrainingOpDesccontract)operatorInfoTraining: <model>) +Sklearn <name> Operatordescription; Sklearn Training group; singletraininginput port; one blocking outputcountVectorizer/tfidfTransformerfalse;target/textnullgetOutputSchemasmodel_name(STRING) +model(BINARY) keyed by the declared output portgeneratePythonCodemake_pipeline(...).fit(X, Y)training modelLogicalOpbase, with the correctoperatorTypediscriminatorAny related issues, documentation, discussions?
Part of the ongoing
workflow-operatorunit-test coverage effort (the training-side counterpart to the Sklearn classifier coverage in #5925/#5939/#5940/#5941/#5945/#5946/#5951).How was this PR tested?
sbt "WorkflowOperator/testOnly org.apache.texera.amber.operator.sklearn.training.*"— 45 tests, all greensbt "WorkflowOperator/Test/scalafmtCheck"andsbt "WorkflowOperator/scalafixAll --check"— cleanWas this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.8 [1M context])