Skip to content

Worker: memory management improvements#1366

Draft
josephjclark wants to merge 4 commits intomainfrom
worker-memory
Draft

Worker: memory management improvements#1366
josephjclark wants to merge 4 commits intomainfrom
worker-memory

Conversation

@josephjclark
Copy link
Copy Markdown
Collaborator

Responding to a spate of OOMkills and lost runs: some experiments with memory management

Fixes #897

AI Usage

Please disclose whether you've used AI anywhere in this PR (it's cool, we just
want to know!):

  • I have used Claude Code
  • I have used another model
  • I have not used AI

You can read more details in our
Responsible AI Policy

@github-project-automation github-project-automation Bot moved this to New Issues in Core Apr 13, 2026
@josephjclark
Copy link
Copy Markdown
Collaborator Author

josephjclark commented Apr 14, 2026

I don't understand why this is failing in tests 🤔

Running around in circles. Even in code where the json streaming function isn't called it seems to fail.

I've also bumped into a possible issue with circular structures. It makes me nervous - shouldn't the JSON.stringify function throw if it hits a circular structure? That's something that could be hurting us right now in production

EDIT: Well we do seem to handle circular structures - the runtime must be dealing with that before events are emitted

{
  "x": {
    "x": "[Circular]"
  },
  "data": {}
}

@josephjclark
Copy link
Copy Markdown
Collaborator Author

managed to fix one of the failing tests but still got fails. Will have to look into this later. Claude unable to help.

@josephjclark josephjclark changed the base branch from main to release/next April 24, 2026 12:35
@josephjclark
Copy link
Copy Markdown
Collaborator Author

So nervous of this PR. It's a tiny change really. Works just great locally. But my tests go beserk - random tests seem to fail to return basic events, for no good reason that I can see. Very hard to debug this stuff because it's cross-process. Not sure.

@josephjclark
Copy link
Copy Markdown
Collaborator Author

josephjclark commented Apr 24, 2026

If I stare at this any longer I'll go actually mad.

But I might have figured it out.

It's just the asynchronicity in the logging.

If you make the stringify algo async, then it'll exhibit exactly the same behaviour and events will stop being emitted. I thought it was a secret exception in the async iterator.

But no - what I think is happening is that the workflow is finishing while events are still being stringified to be emitted by the child process. But the child process gets killed as soon as the run is over.

We need some tracker somewhere which makes sure that all pending events are properly set.

This behaviour does make me wonder if we have an async hole in prod which is the underlying cause of some rare exits. Could the worker thread be getting closed before messages have been sent?

EDIT: not 100% sold on this theory (just allowing the worker more time beefore closing doesn't help) but I'm sure it's something around the async message processing for sure. This may also cause events to exit the worker out of sequence, so maybe we need a queue here too, just like in the main engine (same queue?)

Base automatically changed from release/next to main April 27, 2026 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: New Issues

Development

Successfully merging this pull request may close these issues.

Worker: Runs get lost if the dataclip is >150mb

2 participants