Add megatron_ray_fault_tolerant example with comprehensive fault tolerance implementation by xyuzh · Pull Request #19 · anyscale/examples

xyuzh · 2025-11-19T03:22:43Z

Summary

This PR adds a new production-ready example demonstrating fault-tolerant distributed training using Megatron and Ray. The implementation showcases how to build resilient ML training systems that can automatically recover from actor failures without losing progress.

What's New

🆕 megatron_ray_fault_tolerant Example

A complete implementation of PPO-style distributed training with enterprise-grade fault tolerance:

Automatic Failure Recovery: Detects and recovers from actor failures mid-training
Backup Actor Pool: Pre-allocated spare GPUs enable instant worker replacement
Zero Progress Loss: Checkpoint-based recovery ensures training continues from last saved state
Process Group Management: Handles NCCL re-initialization after topology changes
Multi-Dimensional Parallelism: Supports DP, TP, PP, and CP for large-scale training

Key Features

Fault Tolerance Architecture

Health Monitoring: Continuous actor health checks with timeout handling
Graceful Recovery:
- Destroy stale process groups
- Replace failed actors from backup pool
- Re-initialize with updated world size
- Reload state from distributed checkpoints
Backup Strategy: Configurable spare GPU count for various failure scenarios

Distributed Training Capabilities

Megatron Integration: Full support for Megatron-Core parallelism strategies
Flexible Dispatch System:
- MeshDispatch: Smart data sharding across device mesh
- PassThroughDispatch: Broadcast operations to all workers
- Extensible registry for custom strategies
Cloud-Native Checkpointing: S3/GCS support with parallel I/O
PPO Training: Reference implementation with gradient accumulation

Testing & Validation

The example includes a built-in fault tolerance demonstration:

✅ Initialize 4 workers + 4 backup actors
✅ Run training step and save checkpoint
✅ Simulate failure by killing a data-parallel group
✅ Automatically recover using backup actors
✅ Resume training and verify correctness

Run the demo:
use run.sh

Submit to Anyscale:
anyscale job submit -f job.yamlResource Requirements:

GPU instances: g6e.12xlarge (4x L40S GPUs)
Scales: 0-2 nodes (0-8 GPUs + spares)
Storage: S3/GCS for checkpoints

Use Cases

Research: Fault-tolerant experiments on preemptible instances
Production: Reliable long-running training jobs
Cost Optimization: Leverage spot instances with auto-recovery
Large Models: Scale beyond single-node with parallelism
RL Training: PPO and similar on-policy algorithms

Related Work

This example builds on:

Megatron-LM parallelism strategies
Ray's actor model and placement groups

Future Enhancements

Virtual pipeline parallelism support
CPU offload optimization for faster recovery
Async checkpoint saving
Multi-node failure recovery testing
Integration with Ray Train

Note: This example requires GPU resources and cloud storage configuration. See the README for detailed setup instructions.

- Update Ray base image to 2.51.1 and vLLM to 0.11.0 - Add boto3 dependency for S3 operations - Update transformers to 4.57.1 for compatibility - Configure compute resources with auto-selection (max 520 CPU, 128 GPU) - Add disk size configuration options for customer-hosted deployments - Implement robust URL validation and error handling - Add base64 image encoding for Arrow serialization - Add JPEG format validation and 128x128 image resizing - Scale model replicas from 1 to 32 for higher throughput - Optimize batch sizes and memory usage for large-scale processing - Implement session pooling for HTTP requests with retry logic - Add timestamp-based output paths to /mnt/shared_storage - Add run.sh script for job submission with HF_TOKEN

…rance - Implements PPO-style training with Megatron and Ray - Features automatic actor recovery from failures - Includes backup actor pool for seamless replacement - Supports DP, TP, PP, and CP parallelism - Distributed checkpoint saving/loading - Process group re-initialization after failures - Added comprehensive documentation in README files

wujinspire · 2025-11-19T05:34:36Z

megatron_ray_fault_tolerant/main.py

+    pipeline_model_parallel_size: int = 1
+    context_parallel_size: int = 1
+    expert_model_parallel_size: int = 1
+    expert_tensor_parallel_size: int = 1


Need a large scale recover system

wujinspire · 2025-11-19T05:35:47Z

megatron_ray_fault_tolerant/utils.py

+BasicType = Union[int, float, str, bool]
+
+
+@ray.remote(num_gpus=1)


Also need actual GPU actors.

…rance - Implements PPO-style training with Megatron and Ray - Features automatic actor recovery from failures - Includes backup actor pool for seamless replacement - Supports DP, TP, PP, and CP parallelism - Distributed checkpoint saving/loading - Process group re-initialization after failures - Added comprehensive documentation in README files

…rance

…h/examples into HEAD

…mentation

…h/examples into HEAD

… with > 2 nodes now, but that is ok

robertnishihara and others added 5 commits November 18, 2025 19:09

some initial code

dc6b813

Add vllm

b497105

updates

812c876

wujinspire reviewed Nov 19, 2025

View reviewed changes

megatron_ray_fault_tolerant/utils.py

BasicType = Union[int, float, str, bool]

@ray.remote(num_gpus=1)

Copy link

wujinspire Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need actual GPU actors.

xyuzh added 2 commits November 18, 2025 22:25

remove redundant branch

2929546

xyuzh force-pushed the megatron_ray_fault_tolerant branch from 2929546 to dffc213 Compare November 24, 2025 19:40

erictang000 and others added 2 commits November 24, 2025 19:47

change dp-pp init ordering to put things on same node

e16da1e

Add megatron_ray_fault_tolerant example with comprehensive fault tole…

9932155

…rance

xyuzh force-pushed the megatron_ray_fault_tolerant branch from dffc213 to 9932155 Compare November 24, 2025 19:49

erictang000 and others added 7 commits November 24, 2025 19:50

Merge branch 'megatron_ray_fault_tolerant' of https://github.com/xyuz…

405058e

…h/examples into HEAD

Merge branch 'megatron_ray_fault_tolerant' of https://github.com/xyuz…

e716c0b

…h/examples into HEAD

Update megatron_ray_fault_tolerant: job config, main script, and docu…

4936b7e

…mentation

Remove CLAUDE.md

519eb3b

Update main.py configuration

88eec3c

working baseline and in the process of optimizing checkpointing

eebe920

Merge branch 'megatron_ray_fault_tolerant' of https://github.com/xyuz…

e2f1c15

…h/examples into HEAD

xyuzh requested a review from robertnishihara November 26, 2025 17:35

erictang000 added 2 commits November 26, 2025 19:39

fast(ish) model checkpointing/loading with s3

760b93e

test with 8b on 4 nodes - models with tied word embeddings might fail…

ea97b77

… with > 2 nodes now, but that is ok

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add megatron_ray_fault_tolerant example with comprehensive fault tolerance implementation#19

Add megatron_ray_fault_tolerant example with comprehensive fault tolerance implementation#19
xyuzh wants to merge 18 commits intoanyscale:mainfrom
xyuzh:megatron_ray_fault_tolerant

xyuzh commented Nov 19, 2025

Uh oh!

wujinspire Nov 19, 2025

Uh oh!

wujinspire Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		BasicType = Union[int, float, str, bool]


		@ray.remote(num_gpus=1)

Conversation

xyuzh commented Nov 19, 2025

Summary

What's New

🆕 megatron_ray_fault_tolerant Example

Key Features

Fault Tolerance Architecture

Distributed Training Capabilities

Testing & Validation

Use Cases

Related Work

Future Enhancements

Uh oh!

wujinspire Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

wujinspire Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants