fix(sync-service): Handle errors when collecting statistics#4002
fix(sync-service): Handle errors when collecting statistics#4002magnetised wants to merge 1 commit intomainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests.
Additional details and impacted files@@ Coverage Diff @@
## main #4002 +/- ##
===========================================
- Coverage 88.74% 75.75% -12.99%
===========================================
Files 25 11 -14
Lines 2425 693 -1732
Branches 611 171 -440
===========================================
- Hits 2152 525 -1627
+ Misses 271 167 -104
+ Partials 2 1 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I'm pretty sure we had at some point moved the telemetry at the top of the tree for some reason - can we verify that this doesn't conflict with any previous decisions made and ens up causing more problems? We should have left comments on it |
fad4531 to
219670e
Compare
This comment has been minimized.
This comment has been minimized.
|
Found the relevant PR #2722 @icehaunter - it was moved to the top because we wanted telemetry to shut down last, not to start it first. Otherwise there's various telemetry operations happening during shutdowns that don't find the telemetry processes alive to complete I think. |
|
Shouldn't we just keep it at the top but make the polling resilient to the stack not being present? |
219670e to
d0a291d
Compare
d0a291d to
863f935
Compare
msfstef
left a comment
There was a problem hiding this comment.
Since all of these functions exist explicitly for the purpose of collecting telemetry, I would rather we make them return appropriate tuples so in case there's actual errors we report them as normal instead of almost silently stop reporting (I get there's a warning log but we don't want to make it an error log cause that path also includes errors that are expected due to boot order)
- For
ShapeDb.statisticsit's a GenServer call so I suppose we do need to try-catch - although we could make the function itself encapsulate the missing genserver in its output as an error tuple - For
ShapeDb.pending_buffer_sizeit's an ETS query, so we should just guard for missing ETS internally - For
ShapeCache.count_shapesit drills down toQuery.count_shapeswhich already returns a tuple we handle and aWriteBuffer.pending_count_diffwhich is an ETS check - although I'm assuming ifQuery.count_shapespasses then we can assume the ETS should exist, so we would just need to return the error tuple to handle it
I think this would be cleaner and still allows us to catch unexpected errors in our logic? The GenServer one is probably the more annoying one.
| ) | ||
|
|
||
| _ -> | ||
| try do |
There was a problem hiding this comment.
why not make these ShapeDb function return tuples and handle them appropriately? I would rather we don't swallow all errors in the telemetry since some of these are exclusively called here, and any actual errors we want to catch and log would now be lost and we silently stop collecting telemetry
|
@msfstef i wrote a long disagreement with you but actually I think you're right. As you say, these functions are purely for telemetry and there should be a small range of "expected" errors from them. |
863f935 to
5821ed5
Compare
msfstef
left a comment
There was a problem hiding this comment.
Very nice! Minor comment to confirm that the non-error reporting stat is actually correct (sanity check)
| def pending_operations_count(stack_id) do | ||
| :ets.info(operations_table_name(stack_id), :size) | ||
| rescue | ||
| ArgumentError -> 0 |
There was a problem hiding this comment.
I don't remember the exact semantics of this buffer, but I assume this is an in-memory buffer so if the ETS table is not present then the operations are indeed 0 - is this correct or is there a chance of some perissted data not having been loaded and 0 being wrong?
Telemetry is currently started before the rest of the stack, which means that it starts trying to poll information after a few seconds, but there's no guarantee that the rest of the stack is available within the given timeframe.
Wrap all telemetry collection calls in try..catch to avoid spurious, transient, errors.
https://electricsql-04.sentry.io/issues/102382605/?alert_rule_id=130196&alert_timestamp=1773223876687&alert_type=email¬ification_uuid=8fa23ea6-1e04-4a07-98bd-bd0765895424&project=4508410462404688&referrer=digest_email