Remote logging fix#68370
Conversation
Signed-off-by: AndreiLeib <andrei.leibovski@appdirect.com>
jason810496
left a comment
There was a problem hiding this comment.
Nice! The patch make sense to me overall.
Would appreciate to add corresponding unit test when you have a moment, thanks.
| if conn_id in _REMOTE_LOGGING_CONN_CACHE: | ||
| return _REMOTE_LOGGING_CONN_CACHE[conn_id] |
There was a problem hiding this comment.
I would prefer to change the cache read side instead of writing side.
| if (cached_conn_id := _REMOTE_LOGGING_CONN_CACHE.get(conn_id, None)) is not None: | |
| return cached_conn_id |
There was a problem hiding this comment.
What is the value in storing None in the dict at all?
|
However the entire pre-fetch might be moot now. This is caching to a single supervisor process which shuts down when the task finishes. There's not much point in trying to pre-cache as it will always fail now? However this leads to an interesting problem -- some loggers such as GCP's StackDriver or AWS's Cloudwatch Logs want to upload logs as they come in, not in a batch at the end, and for those to work we need to know that we weren't able to get logs and try again once the task has started running and we can access connections. (Which isn't to say we shouldn't have a tactical fix like this to help most loggers, but the main issue is not resolved until we fix it for all) |
Yes, this occurred to me as well but I wanted to be as tactical with the fix as possible. I'll take a closer look at GCP StackDriver and AWS Cloudwatch remote logging sub-systems in a separate branch and see what can be done there to fix this issue. |
yep, will add one before marking ready for review; should be done in a day or two. |
closes: #68366
related discussion on Slack: https://apache-airflow.slack.com/archives/CCQ7EGB1P/p1780575542424439
Do not cache
Noneconnection values at the start of the worker lifecycle so that the worker is able to upload task logs upon task completion. Please see linked issue for full details.Tested via monkey patching on a test environment, logs showed up in the UI and in S3 with this change.
Was generative AI tooling used to co-author this PR?