fix: add @Exclusive tag to prevent cluster-mutating tests from running in parallel#2360
fix: add @Exclusive tag to prevent cluster-mutating tests from running in parallel#2360delthas wants to merge 2 commits intodevelopment/2.14from
Conversation
Hello delthas,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
57627e3 to
ffb4756
Compare
There was a problem hiding this comment.
General thoughts :
- Waiting for you to rerun 3/4 times to analyze real impact on both timing and flakiness
- Still wish we could modify zenko operator to not reconcile every backbeat pod when they are not concerned at all by the change of the configuration
- One trick we may consider here : While we can't control that much the order of execution for the tests, I believe cucumber runs the tests from top to bottom of the file, and maybe also run the files from a to z (that's why the kafka cleaner scenario has its file called zzz.xxx 🌚 ). So here, it might be interesting to put all the problematic tests together at the end of the file instead of having them in the middle
Other things :
- If we merge this pr, need to update the "HOW_TO_WRITE_TEST.MD" : document the new Exclusive tag, and drop the rule 3 about not reconfiguring the env during tests
- William had a different mechanism based on locking a file to do a single task at the same time for all worker, I think you can see the implementation in the Cli-testing folder : Worth to take a look and compare his solution with yours
Edit :
After spending time on this CI, I have also come to the same conclusion that flakiness comes from those pods being killed and recreated a lot of times while the tests are running, so this is defintly one of the big thing we need to fix.
francoisferrand
left a comment
There was a problem hiding this comment.
I got really mixed feeling on this one: we are just putting a bandaid, and not addressing the actual underlying issue (the system should remain stable even when creating locations!!!), just masking the problem and increasing build time...
maybe this is acceptable as a temporary measure or to facilitate investigations, but only in that context: so work must not stop there
- System MUST remain stable and functional even through location creations/etc...
- Either there is a bug in these tests (not waiting on the right events or the right way), or an issue in the software : but it must be fixed
- Introducing exclusivity is the exact opposite of the good practices for tests (test must be idempotent, work in parallel, etc...)
- Remove @exclusive from "Pause and resume archiving to azure" scenario: it only calls the Backbeat API to pause/resume lifecycle, no overlay change or operator reconciliation is triggered. - Update HOW_TO_WRITE_TESTS.md: document @exclusive tag in Rule 3, add note to Rule 4 distinguishing it from atMostOnePicklePerTag. Issue: ZENKO-5228
a63364f to
149ade8
Compare
|
For now, 2 success of out 2 tries on the CTST runs (the previous failures were unrelated prerequisites) |
|
This mechanism works the other way : it is used to "share" some action between different tests, by ensuring it is actually performed just once. Not sure how it translate to "exclusive" tests, which should block all other tests. |
|
Followup created at ZENKO-5240 So far:
I'd suggest merging this, then merging Sylvain's PR, then I work on ZENKO-5240 |
|
As for documenting |
Flakiness is an issue that could not be tackled for yers: the last attempt is the current one, where we try to first run locally and improve troubleshooting capabilities, so we can then eventually fix flakiness. Another way of saying it: once we have identified the root cause (and reconciling is not, it's just a normal event which is a circumstance of the), no problem tweaking the test to reduce flakiness while we do the actual fix (or don't, if we dim it not an issue in production). But as long as we don't, changing this with no plan how to find the issue has very high risk of just hiding the problem under the rug... I also added a snapshot of the flaky tests identified by trunk.io in the ticket for reference when work would continue. Not sure how precise or trustworthy the values are, but a few thoughts anyway:
|
Summary
Fixes intermittent CI failures in
ctst-end2end-shardedcaused by parallel cucumber workers interfering with each other when one worker runs scenarios that mutate cluster-wide state.Problem
The
ctst-end2end-shardedjob runs cucumber with 4 parallel workers (--parallel $PARALLEL_RUNS). Some test scenarios create or modify Zenko locations via the management API, which triggers operator reconciliation and rolling restarts of backbeat components (replication data processor, notification processors, sorbet, etc.).When these cluster-mutating scenarios run in one worker, the other 3 workers' tests are affected — their backbeat pods get killed and recreated mid-flight, causing replication timeouts, kafka cleaner failures, and azure archive restore retry timeouts.
Observed failure (run #8809)
8 out of 4418 scenarios failed:
Root cause timeline
e2e-azure-archive-2-non-versionedviaPOST /config/{id}/locationbackbeat-replication-data-processoracross 6 different ReplicaSets. The data processor is killed and recreated 15 times.backbeat-configsecret (v21 doesn't exist yet), is killed. Processor is completely down for 6 minutes.The CRUD scenario creates 3 locations + modifies 3 locations = 6 reconciliation rounds, each triggering a full rolling restart of all backbeat deployments. The
waitForZenkoToStabilize()call in the CRUD test only blocks that specific worker — the other 3 workers are unaware that pods are being churned.Solution
Add an
@Exclusivetag mechanism to cucumber'ssetParallelCanAssignthat gives tagged scenarios exclusive access to all workers:@Exclusivescenario is running, no other scenario can start on any worker@Exclusivescenario only starts when all other running scenarios have finishedatMostOnePicklePerTaglogic for@ColdStorage,@PRA, etc. is preserved as a fallbackThis is safe from races because the coordinator runs in a single Node.js process —
setParallelCanAssignis called synchronously from the event loop when deciding work placement. Cucumber also has a built-in deadlock safety valve: if all workers go idle but pickles remain, it force-assigns the first one.Scenarios tagged with
@ExclusiveNote: "Pause and resume archiving to azure" was initially tagged but removed after review — it only calls the Backbeat API (
/_/lifecycle/pause/{location}) and does not modify the overlay or trigger operator reconciliation.Alternatives considered
Move location creation to
configure-e2e-ctst.sh— Would eliminate the problem for azure archive CRUD but doesn't generalize to other cluster-mutating scenarios (PRA, website). Would also require significant refactoring of the CRUD test itself.Tag-based ordering (run mutating tests in a separate phase) — Cucumber doesn't natively support phased execution. Would require splitting into multiple
cucumber-jsinvocations, losing the single-report output.Reduce parallelism globally — Would slow down all tests, not just the problematic ones.
The
@Exclusiveapproach is the most targeted: it only serializes the specific scenarios that cause cluster-wide churn, while allowing all other tests to run in parallel as before.Estimated performance impact
Based on a successful run (attempt 4 of #8809):
not @PRA)Current pipeline: ~82 min → estimated with
@Exclusive: ~95 min (+16%)Without retries, the cost drops to ~10%. This is a worthwhile tradeoff for eliminating a major source of CI flakiness that currently requires re-running the entire job (adding 82+ min per retry).
Issue: ZENKO-5228