-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[ML] Fix Non-Deterministic Training Set Selection in RegressionIT testTwoJobsWithSameRandomizeSeedUseSameTrainingSet #138063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| public void testTwoJobsWithSameRandomizeSeedUseSameTrainingSet() throws Exception { | ||
| String sourceIndex = "regression_two_jobs_with_same_randomize_seed_source"; | ||
| indexData(sourceIndex, 100, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indexData is calling directly client().admin().indices().prepareCreate() instead of prepareCreate() of the test framework. This ensures that the index has always 1 shard.
However, it still can have multiple segments which then leads to non-deterministics order in which the reservoir sampling might get the documents. Hence, we need to fix both the shards and the segments to 1.
|
Pinging @elastic/ml-core (Team:ML) |
davidkyle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* main: (135 commits)
Mute org.elasticsearch.upgrades.IndexSortUpgradeIT testIndexSortForNumericTypes {upgradedNodes=1} elastic#138130
Mute org.elasticsearch.upgrades.IndexSortUpgradeIT testIndexSortForNumericTypes {upgradedNodes=2} elastic#138129
Mute org.elasticsearch.search.basic.SearchWithRandomDisconnectsIT testSearchWithRandomDisconnects elastic#138128
[DiskBBQ] avoid EsAcceptDocs bug by calling cost before building iterator (elastic#138127)
Log NOT_PREFERRED shard movements (elastic#138069)
Improve bulk loading of binary doc values (elastic#137995)
Add internal action for getting inference fields and inference results for those fields (elastic#137680)
Address issue with DateFieldMapper#isFieldWithinQuery(...) (elastic#138032)
WriteLoadConstraintDecider: Have separate rate limiting for canRemain and canAllocate decisions (elastic#138067)
Adding NodeContext to TransportBroadcastByNodeAction (elastic#138057)
Mute org.elasticsearch.simdvec.ESVectorUtilTests testSoarDistanceBulk elastic#138117
Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeIT test elastic#137909
Backport batched_response_might_include_reduction_failure version to 8.19 (elastic#138046)
Add summary metrics for tdigest fields (elastic#137982)
Add gp-llm-v2 model ID and inference endpoint (elastic#138045)
Various tracing fixes (elastic#137908)
[ML] Fixing KDE evaluate() to return correct ValueAndMagnitude object (elastic#128602)
Mute org.elasticsearch.xpack.shutdown.NodeShutdownIT testStalledShardMigrationProperlyDetected elastic#115697
[ML] Fix Flaky Audit Message Assertion in testWithDatastream for RegressionIT and ClassificationIT (elastic#138065)
[ML] Fix Non-Deterministic Training Set Selection in RegressionIT testTwoJobsWithSameRandomizeSeedUseSameTrainingSet (elastic#138063)
...
# Conflicts:
# rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/search.vectors/200_dense_vector_docvalue_fields.yml
The test
testTwoJobsWithSameRandomizeSeedUseSameTrainingSetfails intermittently because documents may be processed in different orders during reindexing. Since we use an online reservoir sampling algorithm, this order actually matters. To ensure deterministic reindexing of the document sequence, both the number of shards and the number of segments must be 1.This PR fixes the test by creating the source index with only 1 segment. This ensures deterministic document order during reindexing, resulting in consistent ID assignments and training set selection when using the same seed.
Fixes #117805