feat(action-log): Outbox-based GroupActionLogEntry management#117836
feat(action-log): Outbox-based GroupActionLogEntry management#117836kcons wants to merge 4 commits into
Conversation
| category=OutboxCategory.GROUP_ACTION_LOG_EVENT, | ||
| # Could use a random int to avoid the DB round-trip at the cost of | ||
| # negligible (~1/2^63) collision risk within a shard. | ||
| object_identifier=CellOutbox.next_object_identifier(), |
There was a problem hiding this comment.
I always find this pattern a little funny. Rather than adding a new incrementing counter to postgres for this, we just hop on to the primary key instead 😆
| # (process_shard → transaction.atomic), so the GALE is not yet committed. | ||
| # Defer to on_commit so the GALE is visible to readers on other connections. | ||
| using = router.db_for_write(GroupActionLogEntry) | ||
| if p["force_async_derived"]: |
There was a problem hiding this comment.
Oh interesting, is this just to prevent group log task processing or task queueing failures from tanking the outbox transaction?
There was a problem hiding this comment.
Yep, the on_commit is, the choice between async or not is to show we can accomodate different latency/consistency tolerance trade-offs. It'd be too specific for what we have currently, but we may also want to do "don't let derived failures escape, if we have a db issue, schedule a task so we can move on and be eventually consistent".
There was a problem hiding this comment.
Yeah that's definitely a concern. My thought is just have this enqueue the task and ignore the synchronous form of processing unless we have a really compelling use case for this, but this seems fine.
There was a problem hiding this comment.
It's not just failure isolation, it's also that GALE is expected to be on a different db from derived data, so pre-commit it's not really canonical. I'll add a comment here clarifying the justification.
Also, technically, scheduling task pre-commit could mean we run the task before the new entry is visible, which is broken, but our task system isn't low latency enough in prod for that to be a real risk.
There was a problem hiding this comment.
the problem with task only is that we don't get the "mutating request gets mutated response" property without additional wiring. If we can get that by default (and we should able to), that's nice. If not, maybe we go default async and have "please make sure we're up-to-date" calls and/or "ensure derived data reflects changes in this block" contexts.
There was a problem hiding this comment.
I see. So if that's the case, couldn't we hoist this task creation or synchronous evaluation logic up out of receiver entirely? Like it could live issue/action_log/base.py and provide the same guarantee, unless I'm missing something that on_transaction is doing differently.
Either the outbox transaction and flush succeeded, which means we've exited the outbox_context decorator already, or one of the two failed, and we can't guarantee anything.
Use standard outbox infrastructure.
Better description TODO.