fix: [SDK-4474] self-heal SDK-4388-stuck push subscriptions on session start#2636
Conversation
2716a28 to
2130fd8
Compare
abdulraqeeb33
left a comment
There was a problem hiding this comment.
a few nits that i think we should change
Detekt's ReturnCount rule defaults to a max of 2; the previous body had
3 (early null/type guard, divergence guard, final result) which forced
the suppression. Collapse the divergence branch into a trailing
`return if (divergent) ... else null` expression so the function uses
exactly two `return` keywords:
1. early `return null` for the cache miss / non-push guard
2. final `return if (divergent) { ... } else { null }`
No behavior change. Addresses review feedback on #2636.
Refs: SDK-4474
7a1c3ce to
4dffa22
Compare
abdulraqeeb33
left a comment
There was a problem hiding this comment.
LGTM with just one nit
Adds three tests covering branches in `updateExistingSubscriptionFromCreate` that the SDK-4388 fix introduced but didn't fully exercise: 1. Create + Delete with non-local id is a successful no-op (locks in the delete short-circuit so a future refactor can't accidentally route to PATCH and resurrect a deleted subscription). 2. Network-retryable backend errors on the PATCH path propagate as FAIL_RETRY with retryAfterSeconds (matches the local-id Create behavior and ensures errors aren't swallowed). 3. Multiple batched updates with a non-local-id Create collapse into a single PATCH carrying the *last* update's values (locks in last-wins semantics; mirrors the existing local-id POST test). These complement the two existing non-local-id tests and bring coverage for the new path to parity with the local-id Create path. Made-with: Cursor
When updateSubscription gets a 404 outside the missing-retry window we re-enqueue a CreateSubscriptionOperation so OperationRepo can rebuild the subscription. Reusing the original non-local subscriptionId on that recovery op routes it back through execute() -> updateExistingSubscriptionFromCreate -> PATCH, which immediately 404s again and re-enqueues another non-local Create, looping indefinitely without backoff (observed at ~4-6 PATCH/sec until the backend rate-limits with 429s). Generate a fresh local id via IDManager.createLocalId() for the recovery Create. A local id forces execute() down the POST/rebuild-user path in createSubscription, which is the one that can actually re-create the subscription on the backend.
Locks in the recovery shape after an update-subscription 404 outside the missing-retry window: one CreateSubscriptionOperation in the response operations, with a fresh local subscriptionId (not the original non-local one) so the next execute() dispatches it through the POST / rebuild-user path instead of looping back into PATCH -> 404. All other fields (appId, onesignalId, externalId, type, enabled, address, status) must propagate from the failed update so the recovered subscription matches the desired state.
24a3623 to
37a793c
Compare
Detekt's ReturnCount rule defaults to a max of 2; the previous body had
3 (early null/type guard, divergence guard, final result) which forced
the suppression. Collapse the divergence branch into a trailing
`return if (divergent) ... else null` expression so the function uses
exactly two `return` keywords:
1. early `return null` for the cache miss / non-push guard
2. final `return if (divergent) { ... } else { null }`
No behavior change. Addresses review feedback on #2636.
Refs: SDK-4474
4dffa22 to
bda262f
Compare
📊 Diff Coverage ReportDiff Coverage Report (Changed Lines Only)Gate: aggregate coverage on changed executable lines must be ≥ 80% (JaCoCo line data for lines touched in the diff). Changed Files Coverage
Overall (aggregate gate)26/27 touched executable lines covered (96.3% — requires ≥ 80%) |
Detekt's ReturnCount rule defaults to a max of 2; the previous body had
3 (early null/type guard, divergence guard, final result) which forced
the suppression. Collapse the divergence branch into a trailing
`return if (divergent) ... else null` expression so the function uses
exactly two `return` keywords:
1. early `return null` for the cache miss / non-push guard
2. final `return if (divergent) { ... } else { null }`
No behavior change. Addresses review feedback on #2636.
Refs: SDK-4474
bda262f to
cd1c0a7
Compare
Strip "SDK-4388" prefixes from comments in SubscriptionOperationExecutor and its test class. The descriptive context that followed the ticket id already explains the why; the ticket id itself is noise for future maintainers and is preserved in git history / commit messages.
720e7ea to
3117820
Compare
Detekt's ReturnCount rule defaults to a max of 2; the previous body had
3 (early null/type guard, divergence guard, final result) which forced
the suppression. Collapse the divergence branch into a trailing
`return if (divergent) ... else null` expression so the function uses
exactly two `return` keywords:
1. early `return null` for the cache miss / non-push guard
2. final `return if (divergent) { ... } else { null }`
No behavior change. Addresses review feedback on #2636.
Refs: SDK-4474
cd1c0a7 to
5792403
Compare
|
@claude[agent] review |
Code review complete. All feedback from previous review rounds has been properly addressed: ✅ Function renamed to The implementation looks solid:
The code is ready to merge from a quality perspective. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
…l id
When PATCH /subscriptions/{id} returns 404 outside the missing-retry
window, updateSubscription() already mints a fresh local id for the
recovery CreateSubscriptionOperation to break the PATCH -> 404 loop.
However, the cached SubscriptionModel and ConfigModel.pushSubscriptionId
still hold the stale non-local id, so the rebuild-user path
(LoginUserOperationExecutor.getRebuildOperationsIfCurrentUser) emits a
second CreateSubscriptionOperation carrying that stale id. Both Creates
end up in the POST /users payload and the backend creates two
subscription rows for the same device.
Rewrite the cached SubscriptionModel id and pushSubscriptionId to the
new local id (HYDRATE so no listener reacts) before returning the
recovery op. The rebuild path then emits a Create with the same local
id, dedupe collapses the two ops, and POST /users carries a single
subscription as it does on the un-fixed path.
… on PATCH 404 Extend the PATCH-404 recovery regression test to assert that the cached SubscriptionModel id and ConfigModel.pushSubscriptionId are rewritten to the same fresh local id used on the recovery CreateSubscriptionOperation. Without this realignment the rebuild-user path would emit a duplicate Create with the stale non-local id and LoginUserOperationExecutor's POST /users would create two subscription rows on the backend for the same device.
…n start
The SDK-4388 fix prevents new installs from getting stuck (POST -> PATCH
flip in SubscriptionOperationExecutor), but already-stuck users on older
SDK builds will not self-heal on a plain in-place upgrade: the buggy POST
already returned 200 {} and the local op queue is empty by the time the
fixed SDK runs.
Add a divergence check to RefreshUserOperationExecutor.getUser. The SDK
already calls GET /users/by/onesignal_id/{osid} on every session start,
and the response carries the server view of `enabled` /
`notification_types`. The "device is source of truth" policy for push
subscriptions is preserved (we still don't hydrate the server's push
subscription into the local model), but when the cached local model
resolves to enabled-and-opted-in while the server's record of the same
subscription id reports disabled, we now emit a follow-up
UpdateSubscriptionOperation. That op runs through the standard executor
path and re-asserts local truth via PATCH /subscriptions/{id} on the
next session start - one extra request per stuck device, no consumer
code changes, no migration script, no toggle.
The existing happy-path test in RefreshUserOperationExecutorTests is
updated to set notificationTypes = 1 on the cached push subscription so
it stays on the new healthy path. Dedicated test coverage for the four
divergence-detection boundary cases is added in a follow-up commit.
Refs: SDK-4388
Add four test cases to RefreshUserOperationExecutorTests for the self-heal divergence detection added in the previous commit: 1. divergence: server reports the cached push subscription as disabled while the local model is opted-in and SUBSCRIBED -> exactly one UpdateSubscriptionOperation is returned with the correct ids, enabled=true, status=SUBSCRIBED, and the device's local address. 2. healthy: server matches local (enabled=true, notificationTypes=1) -> no follow-up op (existing healthy path stays unchanged). 3. opted-out: local optedIn=false even though server is also disabled -> no follow-up op (UNSUBSCRIBE is the user's intent, not drift). 4. NO_PERMISSION: local status=NO_PERMISSION with server disabled -> no follow-up op (OS-level denial is the real source of truth). Includes a buildSelfHealHarness helper that wires up the mocks for the four cases with minimal duplication. The test that triggers the WARN log line temporarily lowers Logging.logLevel to NONE to avoid the unmocked android.util.Log.w call in JVM unit tests. Refs: SDK-4388
…ption` Co-authored-by: abdulraqeeb33 <abdulraqeeb33@gmail.com>
Detekt's ReturnCount rule defaults to a max of 2; the previous body had
3 (early null/type guard, divergence guard, final result) which forced
the suppression. Collapse the divergence branch into a trailing
`return if (divergent) ... else null` expression so the function uses
exactly two `return` keywords:
1. early `return null` for the cache miss / non-push guard
2. final `return if (divergent) { ... } else { null }`
No behavior change. Addresses review feedback on #2636.
Refs: SDK-4474
Strip "SDK-4388" / "SDK-4474" prefixes from comments and test descriptions in RefreshUserOperationExecutor and its test class. The descriptive context that followed the ticket id already explains the why; the ticket id itself is noise for future maintainers and is preserved in git history / commit messages.
Rename the local variable in RefreshUserOperationExecutor.getUser from pushSelfHealOperation -> pushSelfHealOperationForStuckSubscription so it matches the builder method buildPushSelfHealOperationForStuckSubscription that produces it. Pure rename, no behavior change.
5792403 to
58c51d1
Compare
Description
One Line Summary
On session start, detect when the server's record of the device's push subscription is disabled while the local model says enabled-and-opted-in, and re-assert local truth via
PATCH /subscriptions/{id}so users already stuck by SDK-4388 self-heal on next launch.Details
Motivation
The fix in #2627 (parent PR) prevents new installs from getting stuck by flipping
POST /subscriptionstoPATCH /subscriptions/{id}inSubscriptionOperationExecutorwhen the subscription id is server-assigned. That covers the forward path, but it does not rescue the cohort that already triggered the bug on an older SDK build:POST /subscriptionsreturned200 {}(server-side no-op for an existing id).create-subscription + update-subscription(SUBSCRIBED)group was therefore acked and dropped from the local op queue.enabled:false, notification_types:0permanently.This is confirmed with the SDK-4388 demo-app reproduction:
PATCH /subscriptions/{id}in the logs.Scope
RefreshUserOperationExecutor.getUser. On every session start the SDK already issuesGET /users/by/onesignal_id/{osid}, whose response carriesenabled/notification_typesper subscription. We now inspect the entry whoseidmatches the device's cached push subscription id, and iflocal.enabled && !server.enabled, we emit a follow-upUpdateSubscriptionOperationreturned viaExecutionResponse.operations. That op runs through the standard executor path and produces onePATCH /subscriptions/{id}carrying the device's true state.SubscriptionModelStore(the existingif (subscriptionModel.type != SubscriptionType.PUSH)filter is preserved). The new code only enforces that policy across the wire when divergence is detected.optedIn=false, statusUNSUBSCRIBE) or one withNO_PERMISSIONdoes not trigger a reconcile — those server-side disables are correct, not stuck.Cost
One extra
PATCH /subscriptions/{id}per stuck device on its first foreground session under the new SDK build. Healthy devices pay nothing. Once the PATCH succeeds, server matches local and no further reconciles fire.Testing
Unit testing
Added 4 cases to
RefreshUserOperationExecutorTests:enabled:false, notification_types:0+ local opted-in + SUBSCRIBED + non-empty token → returns exactly oneUpdateSubscriptionOperationwithenabled=true, status=SUBSCRIBED, correct ids.enabled:true, notification_types:1+ local opted-in → no follow-up op.enabled:false, notification_types:-2+ localoptedIn=false→ no follow-up op (UNSUBSCRIBE is intentional).enabled:false, notification_types:0+ localstatus=NO_PERMISSION→ no follow-up op (OS denial is real).The existing happy-path test
refresh user is successful and models are hydrated properlywas updated to setnotificationTypes = 1on the cached push subscription so it stays on the new healthy path.Test gate:
./gradlew :OneSignal:core:testReleaseUnitTest --tests \"...RefreshUserOperationExecutorTests\"→ 10/10 PASS. Two unrelatedSDKInitTestsfailures exist on the parent branch and are not introduced by this PR.Detekt + production assemble (
:OneSignal:core:detekt,:OneSignal:core:assembleRelease) clean.Manual testing
End-to-end repro and validation are intended to run on the demo app shipped with the parent SDK-4388 branch (which has the "Reproduce SDK-4388" + "Self-heal test" sections):
enabled:false, notification_types:0on dashboard).GET /users/by/onesignal_id/{osid}followed by a singlePATCH /subscriptions/{id}withenabled:true, notification_types:1.GET(no further PATCH) — divergence check is a no-op once reconciled.Affected code checklist
PATCH /subscriptions/{id}per stuck device on first session post-upgradeChecklist
Overview
RefreshUserOperationExecutor.getUserTesting
SDKInitTestsfailures are unrelated to this change)Final pass
buildPushSelfHealOperationOrNullwith explanatory comments referencing SDK-4388 / SDK-4474Branching: This PR targets the SDK-4388 fix branch (#2627 parent) so the two changes can land together as a complete "prevent new + heal old" pair. If the team prefers to ship the SDK-4388 fix first and follow up with this independently, retarget to
main.Refs: SDK-4474, SDK-4388