Skip to content

Refactor e2e RustFS fault plans#143

Merged
GatewayJ merged 17 commits into
rustfs:mainfrom
GatewayJ:codex/refactor-fault-plan
Jun 28, 2026
Merged

Refactor e2e RustFS fault plans#143
GatewayJ merged 17 commits into
rustfs:mainfrom
GatewayJ:codex/refactor-fault-plan

Conversation

@GatewayJ

@GatewayJ GatewayJ commented Jun 20, 2026

Copy link
Copy Markdown
Member

Type of Change

  • New Feature
  • Bug Fix
  • Documentation
  • Performance Improvement
  • Test/CI
  • Refactor
  • Other:

Related Issues

N/A

Summary of Changes

Adds a resolved fault-run contract and lifecycle event stream for RustFS e2e fault tests.

This PR introduces stable run-spec.yaml / run-spec.json artifacts, JSONL run events for future visualization, configurable RustFS pod and volume assumptions, stronger fault plan validation, and stricter artifact validation in the shell runner. It also keeps composite multi-fault execution gated until an explicit composition policy exists.

Checklist

  • I have read and followed the CONTRIBUTING.md guidelines
  • Passed make pre-commit (fmt-check + clippy + test + console-lint + console-fmt-check)
  • Added/updated necessary tests
  • Documentation updated (if needed)
  • CHANGELOG.md updated under [Unreleased] (if user-visible change)
  • CI/CD passed (if applicable)

Impact

  • Breaking change (CRD/API compatibility)
  • Requires doc/config/deployment update
  • Other impact: fault-test artifacts now include run contract and lifecycle event files

Verification

make pre-commit

Additional Notes

The destructive live fault scenario itself still requires a dedicated real Kubernetes or K3s cluster with the required fault-test environment variables and Chaos Mesh/device-mapper prerequisites.

@GatewayJ GatewayJ force-pushed the codex/refactor-fault-plan branch from dbe6653 to 05b6dcf Compare June 28, 2026 08:28
@GatewayJ GatewayJ marked this pull request as ready for review June 28, 2026 08:31
@GatewayJ GatewayJ added this pull request to the merge queue Jun 28, 2026
Merged via the queue into rustfs:main with commit 5428d58 Jun 28, 2026
3 checks passed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 05b6dcfead

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread e2e/src/fault/runner.rs
"checker-pre-recommit-report.json",
&serde_json::to_string_pretty(&pre_recommit_report)?,
)?;
if let Err(error) = pre_recommit_report.require_success() {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Defer strict checks until unknown overwrites are recommitted

When the mixed workload times out or gets an unknown result for an overwrite that actually lands, check_s3_history still models the previous committed value as live and reports a hash mismatch; requiring success here fails the run before recommit_unconfirmed_objects below can reconcile that accepted S3 timeout/unknown outcome. This makes fault scenarios fail on a normal ambiguous-write case rather than on data loss or corruption.

Useful? React with 👍 / 👎.

Comment thread e2e/src/fault/checker.rs
let s3 = s3.clone();
let recorder = recorder.clone();
async move {
let get = s3.get_object_result(&key, &recorder).await?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid recording unknown-write probes in shared history

For a timed-out or unknown PUT that later materializes, this pre-recommit probe records a successful GET in history.jsonl while the model still has no committed live value for that key. The final checker then replays that probe as unexpected_visible_deleted_objects, so the scenario fails even after the recommit step succeeds; use a non-recording probe or tag these records so anomaly detection ignores them.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant