Skip to content

Conversation

@yuxuan-z19
Copy link
Contributor

  • Problem: PyTorch + multiprocessing with spawn avoids “Cannot re-initialize CUDA in forked subprocess,” but long-lived workers hold CUDA contexts, causing GPU memory to accumulate.
  • Solution: Added max_tasks_per_child as a configurable option in the executor. Setting it (e.g., 1) forces worker processes to restart after a number of tasks, ensuring CUDA contexts are released.
  • Impact / Considerations: Allows per-task process respawn to reclaim GPU memory.
  • Related issue: Issue: CUDA workers do not release resources after each evaluate_program when using spawn start method #330
  • Addition: Simplified Config initialization and dict conversion to reduce boilerplate and improve readability.

@yuxuan-z19
Copy link
Contributor Author

Python <= 3.11 does not support max_tasks_per_child, use spawn context instead (see docs)

@codelion codelion merged commit 9290243 into algorithmicsuperintelligence:main Nov 27, 2025
2 checks passed
@yuxuan-z19 yuxuan-z19 deleted the zyx-fix-torch branch November 27, 2025 11:25
@yuxuan-z19 yuxuan-z19 restored the zyx-fix-torch branch November 27, 2025 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants