test(workflow-operator): add unit test coverage for Sklearn training ensemble & linear descriptors by aglinxinyuan · Pull Request #5955 · apache/texera

aglinxinyuan · 2026-06-26T08:53:14Z

What changes were proposed in this PR?

Pin behavior of nine previously-untested Sklearn training operator descriptors in common/workflow-operator. No production-code changes.

Spec	Source class
`SklearnTrainingAdaptiveBoostingOpDescSpec`	`SklearnTrainingAdaptiveBoostingOpDesc`
`SklearnTrainingBaggingOpDescSpec`	`SklearnTrainingBaggingOpDesc`
`SklearnTrainingGradientBoostingOpDescSpec`	`SklearnTrainingGradientBoostingOpDesc`
`SklearnTrainingLogisticRegressionOpDescSpec`	`SklearnTrainingLogisticRegressionOpDesc`
`SklearnTrainingLogisticRegressionCVOpDescSpec`	`SklearnTrainingLogisticRegressionCVOpDesc`
`SklearnTrainingPerceptronOpDescSpec`	`SklearnTrainingPerceptronOpDesc`
`SklearnTrainingPassiveAggressiveOpDescSpec`	`SklearnTrainingPassiveAggressiveOpDesc`
`SklearnTrainingSDGOpDescSpec`	`SklearnTrainingSDGOpDesc`
`SklearnTrainingLinearRegressionOpDescSpec`	`SklearnTrainingLinearRegressionOpDesc`

Behavior pinned (shared SklearnTrainingOpDesc contract)

Surface	Contract
`operatorInfo`	exact model name (`Training: <model>`) + `Sklearn <name> Operator` description; Sklearn Training group; single `training` input port; one blocking output
field defaults	`countVectorizer`/`tfidfTransformer` `false`; `target`/`text` `null`
`getOutputSchemas`	`model_name` (STRING) + `model` (BINARY) keyed by the declared output port
`generatePythonCode`	imports the matching sklearn estimator and builds the `make_pipeline(...).fit(X, Y)` training model
Round-trip	config fields preserved through the polymorphic `LogicalOp` base, with the correct `operatorType` discriminator

Any related issues, documentation, discussions?

Part of the ongoing workflow-operator unit-test coverage effort (the training-side counterpart to the Sklearn classifier coverage in #5925/#5939/#5940/#5941/#5945/#5946/#5951).

How was this PR tested?

sbt "WorkflowOperator/testOnly org.apache.texera.amber.operator.sklearn.training.*" — 45 tests, all green
sbt "WorkflowOperator/Test/scalafmtCheck" and sbt "WorkflowOperator/scalafixAll --check" — clean
CI to confirm

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8 [1M context])

…ensemble & linear descriptors

github-actions · 2026-06-26T09:00:51Z

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

No candidates found from git blame history.

Copilot

Pull request overview

This PR adds ScalaTest unit specs under common/workflow-operator to pin the current behavior of nine previously-untested Sklearn training operator descriptors (operator metadata, default config values, output schema shape, Python codegen basics, and JSON polymorphic round-trip via LogicalOp).

Changes:

Add 9 new AnyFlatSpec test classes covering SklearnTraining*OpDesc descriptors in the Sklearn training family.
Assert consistent operatorInfo contract (group, ports, blocking output) and config defaults (countVectorizer/tfidfTransformer false; target/text null).
Validate output schema fields (model_name STRING, model BINARY), basic codegen import expectations, and operatorType discriminator round-tripping.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingAdaptiveBoostingOpDescSpec.scala	Adds unit coverage for AdaBoost training descriptor contract/codegen/round-trip.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingBaggingOpDescSpec.scala	Adds unit coverage for Bagging training descriptor contract/codegen/round-trip.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingGradientBoostingOpDescSpec.scala	Adds unit coverage for Gradient Boosting training descriptor contract/codegen/round-trip.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingLinearRegressionOpDescSpec.scala	Adds unit coverage for Linear Regression training descriptor contract/codegen/round-trip.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingLogisticRegressionCVOpDescSpec.scala	Adds unit coverage for LogisticRegressionCV training descriptor contract/codegen/round-trip.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingLogisticRegressionOpDescSpec.scala	Adds unit coverage for Logistic Regression training descriptor contract/codegen/round-trip.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingPassiveAggressiveOpDescSpec.scala	Adds unit coverage for Passive Aggressive training descriptor contract/codegen/round-trip.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingPerceptronOpDescSpec.scala	Adds unit coverage for Perceptron training descriptor contract/codegen/round-trip.
common/workflow-operator/src/test/scala/org/apache/texera/amber/operator/sklearn/training/SklearnTrainingSDGOpDescSpec.scala	Adds unit coverage for SGD training descriptor contract/codegen/round-trip (using existing `SDG` naming in codebase).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

codecov-commenter · 2026-06-26T09:10:04Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 54.91%. Comparing base (2ebfc28) to head (02aa8da).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff            @@
##               main    #5955   +/-   ##
=========================================
  Coverage     54.91%   54.91%           
  Complexity     2956     2956           
=========================================
  Files          1117     1117           
  Lines         43133    43133           
  Branches       4648     4648           
=========================================
+ Hits          23685    23686    +1     
+ Misses        18054    18051    -3     
- Partials       1394     1396    +2

Flag	Coverage Δ		*Carryforward flag
access-control-service	`70.00% <ø> (ø)`
agent-service	`34.36% <ø> (ø)`		Carriedforward from 2ebfc28
amber	`57.02% <ø> (+<0.01%)`	⬆️
computing-unit-managing-service	`0.00% <ø> (ø)`
config-service	`51.56% <ø> (ø)`
file-service	`59.02% <ø> (ø)`
frontend	`48.67% <ø> (ø)`		Carriedforward from 2ebfc28
notebook-migration-service	`78.57% <ø> (ø)`
pyamber	`90.20% <ø> (ø)`		Carriedforward from 2ebfc28
python	`90.76% <ø> (ø)`		Carriedforward from 2ebfc28
workflow-compiling-service	`55.14% <ø> (ø)`

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2026-06-26T09:18:01Z

⚠️ Benchmark changes need a look

🟢 2 better · 🔴 5 worse · ⚪ 8 noise (<±5%) · 0 without baseline

Compared against main 2ebfc28 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

	config	throughput	MB/s	latency	max Δ latest / 7d
🔴	bs=10 sw=10 sl=64	364	0.222	26,473/37,622/37,622 us	🔴 +11.6% / 🔴 +152.7%
🟢	bs=100 sw=10 sl=64	802	0.489	124,621/140,467/140,467 us	🟢 -6.1% / 🔴 +32.2%
⚪	bs=1000 sw=10 sl=64	916	0.559	1,090,698/1,118,560/1,118,560 us	⚪ within ±5% / 🔴 +12.4%

Baseline details

Latest main 2ebfc28 from same runner

config	metric	PR	latest main	7d avg	Δ latest	Δ 7d
bs=10 sw=10 sl=64	throughput	364 tuples/sec	396 tuples/sec	781.13 tuples/sec	-8.1%	-53.4%
bs=10 sw=10 sl=64	MB/s	0.222 MB/s	0.242 MB/s	0.477 MB/s	-8.3%	-53.4%
bs=10 sw=10 sl=64	p50	26,473 us	24,766 us	12,542 us	+6.9%	+111.1%
bs=10 sw=10 sl=64	p95	37,622 us	33,719 us	14,886 us	+11.6%	+152.7%
bs=10 sw=10 sl=64	p99	37,622 us	33,719 us	17,580 us	+11.6%	+114.0%
bs=100 sw=10 sl=64	throughput	802 tuples/sec	820 tuples/sec	999.37 tuples/sec	-2.2%	-19.7%
bs=100 sw=10 sl=64	MB/s	0.489 MB/s	0.501 MB/s	0.61 MB/s	-2.4%	-19.8%
bs=100 sw=10 sl=64	p50	124,621 us	121,333 us	99,687 us	+2.7%	+25.0%
bs=100 sw=10 sl=64	p95	140,467 us	149,535 us	106,271 us	-6.1%	+32.2%
bs=100 sw=10 sl=64	p99	140,467 us	149,535 us	115,445 us	-6.1%	+21.7%
bs=1000 sw=10 sl=64	throughput	916 tuples/sec	926 tuples/sec	1,036 tuples/sec	-1.1%	-11.6%
bs=1000 sw=10 sl=64	MB/s	0.559 MB/s	0.565 MB/s	0.632 MB/s	-1.1%	-11.6%
bs=1000 sw=10 sl=64	p50	1,090,698 us	1,082,440 us	970,675 us	+0.8%	+12.4%
bs=1000 sw=10 sl=64	p95	1,118,560 us	1,103,515 us	1,011,928 us	+1.4%	+10.5%
bs=1000 sw=10 sl=64	p99	1,118,560 us	1,103,515 us	1,045,045 us	+1.4%	+7.0%

Raw CSV

config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,548.96,200,128000,364,0.222,26473.10,37622.43,37622.43
1,100,10,64,20,2494.46,2000,1280000,802,0.489,124621.34,140467.21,140467.21
2,1000,10,64,20,21838.09,20000,12800000,916,0.559,1090698.21,1118560.47,1118560.47

test(workflow-operator): add unit test coverage for Sklearn training …

02aa8da

…ensemble & linear descriptors

Copilot AI review requested due to automatic review settings June 26, 2026 08:53

github-actions Bot added the common label Jun 26, 2026

github-actions Bot assigned aglinxinyuan Jun 26, 2026

Copilot started reviewing on behalf of aglinxinyuan June 26, 2026 08:59 View session

Copilot AI reviewed Jun 26, 2026

View reviewed changes

aglinxinyuan requested a review from xuang7 June 26, 2026 23:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(workflow-operator): add unit test coverage for Sklearn training ensemble & linear descriptors#5955

test(workflow-operator): add unit test coverage for Sklearn training ensemble & linear descriptors#5955
aglinxinyuan wants to merge 1 commit into
apache:mainfrom
aglinxinyuan:test-sklearn-training-ensemble-linear-descriptors

aglinxinyuan commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

codecov-commenter commented Jun 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

aglinxinyuan commented Jun 26, 2026

What changes were proposed in this PR?

Any related issues, documentation, discussions?

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

github-actions Bot commented Jun 26, 2026

Automated Reviewer Suggestions

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov-commenter commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jun 26, 2026

⚠️ Benchmark changes need a look

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Jun 26, 2026 •

edited

Loading