test(workflow-operator): add unit test coverage for Sklearn training NB & tree descriptors#5954
Conversation
…NB & tree descriptors
Automated Reviewer SuggestionsBased on the
|
There was a problem hiding this comment.
Pull request overview
Adds ScalaTest unit coverage for the previously-untested Sklearn training operator descriptors in common/workflow-operator, pinning their shared SklearnTrainingOpDesc contract without modifying production code.
Changes:
- Added 8 descriptor-spec test files covering
operatorInfo, config defaults, output schema, generated Python code snippets, and polymorphic JSON round-tripping. - Standardized assertions to match the existing Sklearn descriptor-spec pattern already used in
org.apache.texera.amber.operator.sklearn.*tests.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingBernoulliNaiveBayesOpDescSpec.scala | Adds contract tests for BernoulliNB training descriptor. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingComplementNaiveBayesOpDescSpec.scala | Adds contract tests for ComplementNB training descriptor. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingGaussianNaiveBayesOpDescSpec.scala | Adds contract tests for GaussianNB training descriptor. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingMultinomialNaiveBayesOpDescSpec.scala | Adds contract tests for MultinomialNB training descriptor. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingDecisionTreeOpDescSpec.scala | Adds contract tests for DecisionTreeClassifier training descriptor. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingExtraTreeOpDescSpec.scala | Adds contract tests for ExtraTreeClassifier training descriptor. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingExtraTreesOpDescSpec.scala | Adds contract tests for ExtraTreesClassifier training descriptor. |
| common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingRandomForestOpDescSpec.scala | Adds contract tests for RandomForestClassifier training descriptor. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5954 +/- ##
============================================
+ Coverage 54.91% 54.93% +0.02%
- Complexity 2956 2960 +4
============================================
Files 1117 1117
Lines 43133 43133
Branches 4648 4648
============================================
+ Hits 23685 23694 +9
+ Misses 18054 18048 -6
+ Partials 1394 1391 -3
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 419 | 0.256 | 22,770/30,543/30,543 us | 🔴 +9.3% / 🔴 +105.2% |
| ⚪ | bs=100 sw=10 sl=64 | 856 | 0.522 | 114,067/142,266/142,266 us | ⚪ within ±5% / 🔴 +33.9% |
| ⚪ | bs=1000 sw=10 sl=64 | 958 | 0.585 | 1,046,739/1,073,850/1,073,850 us | ⚪ within ±5% / 🔴 +7.8% |
Baseline details
Latest main 2ebfc28 from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 419 tuples/sec | 446 tuples/sec | 781.13 tuples/sec | -6.1% | -46.4% |
| bs=10 sw=10 sl=64 | MB/s | 0.256 MB/s | 0.272 MB/s | 0.477 MB/s | -5.9% | -46.3% |
| bs=10 sw=10 sl=64 | p50 | 22,770 us | 20,824 us | 12,542 us | +9.3% | +81.6% |
| bs=10 sw=10 sl=64 | p95 | 30,543 us | 30,674 us | 14,886 us | -0.4% | +105.2% |
| bs=10 sw=10 sl=64 | p99 | 30,543 us | 30,674 us | 17,580 us | -0.4% | +73.7% |
| bs=100 sw=10 sl=64 | throughput | 856 tuples/sec | 852 tuples/sec | 999.37 tuples/sec | +0.5% | -14.3% |
| bs=100 sw=10 sl=64 | MB/s | 0.522 MB/s | 0.52 MB/s | 0.61 MB/s | +0.4% | -14.4% |
| bs=100 sw=10 sl=64 | p50 | 114,067 us | 115,893 us | 99,687 us | -1.6% | +14.4% |
| bs=100 sw=10 sl=64 | p95 | 142,266 us | 143,085 us | 106,271 us | -0.6% | +33.9% |
| bs=100 sw=10 sl=64 | p99 | 142,266 us | 143,085 us | 115,445 us | -0.6% | +23.2% |
| bs=1000 sw=10 sl=64 | throughput | 958 tuples/sec | 975 tuples/sec | 1,036 tuples/sec | -1.7% | -7.5% |
| bs=1000 sw=10 sl=64 | MB/s | 0.585 MB/s | 0.595 MB/s | 0.632 MB/s | -1.7% | -7.5% |
| bs=1000 sw=10 sl=64 | p50 | 1,046,739 us | 1,017,296 us | 970,675 us | +2.9% | +7.8% |
| bs=1000 sw=10 sl=64 | p95 | 1,073,850 us | 1,068,932 us | 1,011,928 us | +0.5% | +6.1% |
| bs=1000 sw=10 sl=64 | p99 | 1,073,850 us | 1,068,932 us | 1,045,045 us | +0.5% | +2.8% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,476.85,200,128000,419,0.256,22770.33,30542.57,30542.57
1,100,10,64,20,2337.37,2000,1280000,856,0.522,114067.17,142265.90,142265.90
2,1000,10,64,20,20868.46,20000,12800000,958,0.585,1046739.23,1073850.44,1073850.44
What changes were proposed in this PR?
Pin behavior of eight previously-untested Sklearn training operator descriptors in
common/workflow-operator. No production-code changes.SklearnTrainingBernoulliNaiveBayesOpDescSpecSklearnTrainingBernoulliNaiveBayesOpDescSklearnTrainingComplementNaiveBayesOpDescSpecSklearnTrainingComplementNaiveBayesOpDescSklearnTrainingGaussianNaiveBayesOpDescSpecSklearnTrainingGaussianNaiveBayesOpDescSklearnTrainingMultinomialNaiveBayesOpDescSpecSklearnTrainingMultinomialNaiveBayesOpDescSklearnTrainingDecisionTreeOpDescSpecSklearnTrainingDecisionTreeOpDescSklearnTrainingExtraTreeOpDescSpecSklearnTrainingExtraTreeOpDescSklearnTrainingExtraTreesOpDescSpecSklearnTrainingExtraTreesOpDescSklearnTrainingRandomForestOpDescSpecSklearnTrainingRandomForestOpDescBehavior pinned (shared
SklearnTrainingOpDesccontract)operatorInfoTraining: <model>) +Sklearn <name> Operatordescription; Sklearn Training group; singletraininginput port; one blocking outputcountVectorizer/tfidfTransformerfalse;target/textnullgetOutputSchemasmodel_name(STRING) +model(BINARY) keyed by the declared output portgeneratePythonCodemake_pipeline(...).fit(X, Y)training modelLogicalOpbase, with the correctoperatorTypediscriminatorAny related issues, documentation, discussions?
Part of the ongoing
workflow-operatorunit-test coverage effort (the training-side counterpart to the Sklearn classifier coverage in #5925/#5939/#5940/#5941/#5945/#5946/#5951).How was this PR tested?
sbt "WorkflowOperator/testOnly org.apache.texera.amber.operator.sklearn.training.*"— 40 tests, all greensbt "WorkflowOperator/Test/scalafmtCheck"andsbt "WorkflowOperator/scalafixAll --check"— cleanWas this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.8 [1M context])