Skip to content

fix: [SDK-4474] self-heal SDK-4388-stuck push subscriptions on session start#2636

Merged
sherwinski merged 15 commits into
mainfrom
sherwin/sdk-4474-self-heal-stuck-push-subscriptions
May 15, 2026
Merged

fix: [SDK-4474] self-heal SDK-4388-stuck push subscriptions on session start#2636
sherwinski merged 15 commits into
mainfrom
sherwin/sdk-4474-self-heal-stuck-push-subscriptions

Conversation

@sherwinski
Copy link
Copy Markdown
Contributor

@sherwinski sherwinski commented May 6, 2026

Description

One Line Summary

On session start, detect when the server's record of the device's push subscription is disabled while the local model says enabled-and-opted-in, and re-assert local truth via PATCH /subscriptions/{id} so users already stuck by SDK-4388 self-heal on next launch.

Details

Motivation

The fix in #2627 (parent PR) prevents new installs from getting stuck by flipping POST /subscriptions to PATCH /subscriptions/{id} in SubscriptionOperationExecutor when the subscription id is server-assigned. That covers the forward path, but it does not rescue the cohort that already triggered the bug on an older SDK build:

  • The buggy POST /subscriptions returned 200 {} (server-side no-op for an existing id).
  • The merged create-subscription + update-subscription(SUBSCRIBED) group was therefore acked and dropped from the local op queue.
  • On in-place upgrade to the fixed SDK there is nothing left in the queue to retry, and the server view stays at enabled:false, notification_types:0 permanently.

This is confirmed with the SDK-4388 demo-app reproduction:

  • force-stop on the buggy build (server stuck)
  • upgrade to the fixed build
  • relaunch — no PATCH /subscriptions/{id} in the logs.

Scope

  • Affected: RefreshUserOperationExecutor.getUser. On every session start the SDK already issues GET /users/by/onesignal_id/{osid}, whose response carries enabled / notification_types per subscription. We now inspect the entry whose id matches the device's cached push subscription id, and if local.enabled && !server.enabled, we emit a follow-up UpdateSubscriptionOperation returned via ExecutionResponse.operations. That op runs through the standard executor path and produces one PATCH /subscriptions/{id} carrying the device's true state.
  • Unchanged: "Device is the source of truth for push" policy. We still do not hydrate the server's push subscription into SubscriptionModelStore (the existing if (subscriptionModel.type != SubscriptionType.PUSH) filter is preserved). The new code only enforces that policy across the wire when divergence is detected.
  • No-op for healthy users. Server-matches-local triggers no PATCH.
  • Respects user intent. A locally-opted-out subscription (optedIn=false, status UNSUBSCRIBE) or one with NO_PERMISSION does not trigger a reconcile — those server-side disables are correct, not stuck.

Cost

One extra PATCH /subscriptions/{id} per stuck device on its first foreground session under the new SDK build. Healthy devices pay nothing. Once the PATCH succeeds, server matches local and no further reconciles fire.

Testing

Unit testing

Added 4 cases to RefreshUserOperationExecutorTests:

  1. divergence — server enabled:false, notification_types:0 + local opted-in + SUBSCRIBED + non-empty token → returns exactly one UpdateSubscriptionOperation with enabled=true, status=SUBSCRIBED, correct ids.
  2. healthy — server enabled:true, notification_types:1 + local opted-in → no follow-up op.
  3. opted-out — server enabled:false, notification_types:-2 + local optedIn=false → no follow-up op (UNSUBSCRIBE is intentional).
  4. NO_PERMISSION — server enabled:false, notification_types:0 + local status=NO_PERMISSION → no follow-up op (OS denial is real).

The existing happy-path test refresh user is successful and models are hydrated properly was updated to set notificationTypes = 1 on the cached push subscription so it stays on the new healthy path.

Test gate: ./gradlew :OneSignal:core:testReleaseUnitTest --tests \"...RefreshUserOperationExecutorTests\" → 10/10 PASS. Two unrelated SDKInitTests failures exist on the parent branch and are not introduced by this PR.

Detekt + production assemble (:OneSignal:core:detekt, :OneSignal:core:assembleRelease) clean.

Manual testing

End-to-end repro and validation are intended to run on the demo app shipped with the parent SDK-4388 branch (which has the "Reproduce SDK-4388" + "Self-heal test" sections):

  1. Force-stop demo on the broken SDK build with cached stuck state (enabled:false, notification_types:0 on dashboard).
  2. Upgrade in place to this branch, relaunch.
  3. Expect on first session: GET /users/by/onesignal_id/{osid} followed by a single PATCH /subscriptions/{id} with enabled:true, notification_types:1.
  4. Dashboard subscription flips from "Never Subscribed" to "Subscribed".
  5. On subsequent sessions: only GET (no further PATCH) — divergence check is a no-op once reconciled.

Affected code checklist

  • Notifications
    • Display
    • Open
    • Push Processing
    • Confirm Deliveries
  • Outcomes
  • Sessions
  • In-App Messaging
  • REST API requests — adds at most one PATCH /subscriptions/{id} per stuck device on first session post-upgrade
  • Public API changes

Checklist

Overview

  • I have filled out all REQUIRED sections above
  • PR does one thing — adds a single divergence detection branch in RefreshUserOperationExecutor.getUser
  • No public API changes

Testing

  • Unit test coverage added for all four boundary cases
  • All automated tests pass (pre-existing SDKInitTests failures are unrelated to this change)
  • Personally tested on device — pending end-to-end verification with the demo-app repro flow described above

Final pass

  • Code is as readable as possible — divergence logic isolated in a single helper buildPushSelfHealOperationOrNull with explanatory comments referencing SDK-4388 / SDK-4474
  • Self-reviewed

Branching: This PR targets the SDK-4388 fix branch (#2627 parent) so the two changes can land together as a complete "prevent new + heal old" pair. If the team prefers to ship the SDK-4388 fix first and follow up with this independently, retarget to main.

Refs: SDK-4474, SDK-4388

@sherwinski sherwinski force-pushed the sherwin/sdk-4474-self-heal-stuck-push-subscriptions branch from 2716a28 to 2130fd8 Compare May 6, 2026 18:33
Copy link
Copy Markdown
Contributor

@abdulraqeeb33 abdulraqeeb33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few nits that i think we should change

sherwinski added a commit that referenced this pull request May 6, 2026
Detekt's ReturnCount rule defaults to a max of 2; the previous body had
3 (early null/type guard, divergence guard, final result) which forced
the suppression. Collapse the divergence branch into a trailing
`return if (divergent) ... else null` expression so the function uses
exactly two `return` keywords:

  1. early `return null` for the cache miss / non-push guard
  2. final `return if (divergent) { ... } else { null }`

No behavior change. Addresses review feedback on #2636.

Refs: SDK-4474
@sherwinski sherwinski force-pushed the sherwin/sdk-4474-self-heal-stuck-push-subscriptions branch from 7a1c3ce to 4dffa22 Compare May 6, 2026 19:49
Copy link
Copy Markdown
Contributor

@abdulraqeeb33 abdulraqeeb33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with just one nit

fadi-george and others added 5 commits May 11, 2026 16:30
Adds three tests covering branches in `updateExistingSubscriptionFromCreate`
that the SDK-4388 fix introduced but didn't fully exercise:

1. Create + Delete with non-local id is a successful no-op (locks in the
   delete short-circuit so a future refactor can't accidentally route to
   PATCH and resurrect a deleted subscription).
2. Network-retryable backend errors on the PATCH path propagate as
   FAIL_RETRY with retryAfterSeconds (matches the local-id Create behavior
   and ensures errors aren't swallowed).
3. Multiple batched updates with a non-local-id Create collapse into a
   single PATCH carrying the *last* update's values (locks in last-wins
   semantics; mirrors the existing local-id POST test).

These complement the two existing non-local-id tests and bring coverage
for the new path to parity with the local-id Create path.

Made-with: Cursor
When updateSubscription gets a 404 outside the missing-retry window we
re-enqueue a CreateSubscriptionOperation so OperationRepo can rebuild
the subscription. Reusing the original non-local subscriptionId on that
recovery op routes it back through execute() ->
updateExistingSubscriptionFromCreate -> PATCH, which immediately 404s
again and re-enqueues another non-local Create, looping indefinitely
without backoff (observed at ~4-6 PATCH/sec until the backend rate-limits
with 429s).

Generate a fresh local id via IDManager.createLocalId() for the recovery
Create. A local id forces execute() down the POST/rebuild-user path in
createSubscription, which is the one that can actually re-create the
subscription on the backend.
Locks in the recovery shape after an update-subscription 404 outside the
missing-retry window: one CreateSubscriptionOperation in the response
operations, with a fresh local subscriptionId (not the original
non-local one) so the next execute() dispatches it through the POST /
rebuild-user path instead of looping back into PATCH -> 404. All other
fields (appId, onesignalId, externalId, type, enabled, address, status)
must propagate from the failed update so the recovered subscription
matches the desired state.
@sherwinski sherwinski force-pushed the ar/sdk-4388-android-sdk-subscription-state-permanently-stuck-at-never branch from 24a3623 to 37a793c Compare May 11, 2026 23:40
sherwinski added a commit that referenced this pull request May 12, 2026
Detekt's ReturnCount rule defaults to a max of 2; the previous body had
3 (early null/type guard, divergence guard, final result) which forced
the suppression. Collapse the divergence branch into a trailing
`return if (divergent) ... else null` expression so the function uses
exactly two `return` keywords:

  1. early `return null` for the cache miss / non-push guard
  2. final `return if (divergent) { ... } else { null }`

No behavior change. Addresses review feedback on #2636.

Refs: SDK-4474
@sherwinski sherwinski force-pushed the sherwin/sdk-4474-self-heal-stuck-push-subscriptions branch from 4dffa22 to bda262f Compare May 12, 2026 00:20
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 12, 2026

📊 Diff Coverage Report

Diff Coverage Report (Changed Lines Only)

Gate: aggregate coverage on changed executable lines must be ≥ 80% (JaCoCo line data for lines touched in the diff).

Changed Files Coverage

  • RefreshUserOperationExecutor.kt: 26/27 touched executable lines (96.3%) (70 touched lines in diff)

Overall (aggregate gate)

26/27 touched executable lines covered (96.3% — requires ≥ 80%)

📥 View workflow run

sherwinski added a commit that referenced this pull request May 12, 2026
Detekt's ReturnCount rule defaults to a max of 2; the previous body had
3 (early null/type guard, divergence guard, final result) which forced
the suppression. Collapse the divergence branch into a trailing
`return if (divergent) ... else null` expression so the function uses
exactly two `return` keywords:

  1. early `return null` for the cache miss / non-push guard
  2. final `return if (divergent) { ... } else { null }`

No behavior change. Addresses review feedback on #2636.

Refs: SDK-4474
@sherwinski sherwinski force-pushed the sherwin/sdk-4474-self-heal-stuck-push-subscriptions branch from bda262f to cd1c0a7 Compare May 12, 2026 00:46
Strip "SDK-4388" prefixes from comments in SubscriptionOperationExecutor
and its test class. The descriptive context that followed the ticket id
already explains the why; the ticket id itself is noise for future
maintainers and is preserved in git history / commit messages.
@sherwinski sherwinski force-pushed the ar/sdk-4388-android-sdk-subscription-state-permanently-stuck-at-never branch from 720e7ea to 3117820 Compare May 12, 2026 16:57
sherwinski added a commit that referenced this pull request May 12, 2026
Detekt's ReturnCount rule defaults to a max of 2; the previous body had
3 (early null/type guard, divergence guard, final result) which forced
the suppression. Collapse the divergence branch into a trailing
`return if (divergent) ... else null` expression so the function uses
exactly two `return` keywords:

  1. early `return null` for the cache miss / non-push guard
  2. final `return if (divergent) { ... } else { null }`

No behavior change. Addresses review feedback on #2636.

Refs: SDK-4474
@sherwinski sherwinski force-pushed the sherwin/sdk-4474-self-heal-stuck-push-subscriptions branch from cd1c0a7 to 5792403 Compare May 12, 2026 16:57
@sherwinski
Copy link
Copy Markdown
Contributor Author

@claude[agent] review

@Claude
Copy link
Copy Markdown

Claude AI commented May 12, 2026

@claude[agent] review

Code review complete. All feedback from previous review rounds has been properly addressed:

✅ Function renamed to buildPushSelfHealOperationForStuckSubscription (5792403)
✅ Log level changed from warn to info (1358cd9)
@Suppress("ReturnCount") removed via 2-return refactor (78f3903)
✅ Ticket references cleaned up from comments/tests (06306d0)

The implementation looks solid:

  • Divergence detection logic is clean and well-isolated in the helper function
  • The 2-return form (return if (divergent) { ... } else null) is more idiomatic than the previous suppressed multi-return version
  • All boundary cases are covered in the test suite (divergence, healthy, opted-out, NO_PERMISSION)
  • "Device as source of truth" policy is correctly preserved—only PUSH subscriptions that are locally enabled but server-disabled trigger reconciliation
  • No unintended side effects; the change is surgical and focused

The code is ready to merge from a quality perspective.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • developer.huawei.com
    • Triggering command: /usr/lib/jvm/temurin-17-jdk-amd64/bin/java /usr/lib/jvm/temurin-17-jdk-amd64/bin/java -XX:MaxMetaspaceSize=1g -XX:+HeapDumpOnOutOfMemoryError --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.prefs/java.util.prefs=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED --add-opens=java.base/java.nio.charset=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED -Xmx4096m -Dfile.encoding=UTF-8 -Duser.country -Duser.language=en -Duser.variant -cp /home/REDACTED/.gradle/wrapper/dists/gradle-8.10.2-all/7iv73wktx1xtkvlq19urqw1wm/gradle-8.10.2/lib/gradle-daemon-main-8.10.2.jar -javaagent:/home/REDACTED/.gradle/wrapper/dists/gradle-8.10.2-all/7iv73wktx1xtkvlq19urqw1wm/gradle-8.10.2/lib/agents/gradle-instrumentation-agent-8.10.2.jar (dns block)
  • dl.google.com
    • Triggering command: /usr/lib/jvm/temurin-17-jdk-amd64/bin/java /usr/lib/jvm/temurin-17-jdk-amd64/bin/java -XX:MaxMetaspaceSize=1g -XX:+HeapDumpOnOutOfMemoryError --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.prefs/java.util.prefs=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED --add-opens=java.base/java.nio.charset=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED -Xmx4096m -Dfile.encoding=UTF-8 -Duser.country -Duser.language=en -Duser.variant -cp /home/REDACTED/.gradle/wrapper/dists/gradle-8.10.2-all/7iv73wktx1xtkvlq19urqw1wm/gradle-8.10.2/lib/gradle-daemon-main-8.10.2.jar -javaagent:/home/REDACTED/.gradle/wrapper/dists/gradle-8.10.2-all/7iv73wktx1xtkvlq19urqw1wm/gradle-8.10.2/lib/agents/gradle-instrumentation-agent-8.10.2.jar (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

…l id

When PATCH /subscriptions/{id} returns 404 outside the missing-retry
window, updateSubscription() already mints a fresh local id for the
recovery CreateSubscriptionOperation to break the PATCH -> 404 loop.
However, the cached SubscriptionModel and ConfigModel.pushSubscriptionId
still hold the stale non-local id, so the rebuild-user path
(LoginUserOperationExecutor.getRebuildOperationsIfCurrentUser) emits a
second CreateSubscriptionOperation carrying that stale id. Both Creates
end up in the POST /users payload and the backend creates two
subscription rows for the same device.

Rewrite the cached SubscriptionModel id and pushSubscriptionId to the
new local id (HYDRATE so no listener reacts) before returning the
recovery op. The rebuild path then emits a Create with the same local
id, dedupe collapses the two ops, and POST /users carries a single
subscription as it does on the un-fixed path.
sherwinski and others added 8 commits May 12, 2026 16:29
… on PATCH 404

Extend the PATCH-404 recovery regression test to assert that the cached
SubscriptionModel id and ConfigModel.pushSubscriptionId are rewritten
to the same fresh local id used on the recovery
CreateSubscriptionOperation. Without this realignment the rebuild-user
path would emit a duplicate Create with the stale non-local id and
LoginUserOperationExecutor's POST /users would create two subscription
rows on the backend for the same device.
…n start

The SDK-4388 fix prevents new installs from getting stuck (POST -> PATCH
flip in SubscriptionOperationExecutor), but already-stuck users on older
SDK builds will not self-heal on a plain in-place upgrade: the buggy POST
already returned 200 {} and the local op queue is empty by the time the
fixed SDK runs.

Add a divergence check to RefreshUserOperationExecutor.getUser. The SDK
already calls GET /users/by/onesignal_id/{osid} on every session start,
and the response carries the server view of `enabled` /
`notification_types`. The "device is source of truth" policy for push
subscriptions is preserved (we still don't hydrate the server's push
subscription into the local model), but when the cached local model
resolves to enabled-and-opted-in while the server's record of the same
subscription id reports disabled, we now emit a follow-up
UpdateSubscriptionOperation. That op runs through the standard executor
path and re-asserts local truth via PATCH /subscriptions/{id} on the
next session start - one extra request per stuck device, no consumer
code changes, no migration script, no toggle.

The existing happy-path test in RefreshUserOperationExecutorTests is
updated to set notificationTypes = 1 on the cached push subscription so
it stays on the new healthy path. Dedicated test coverage for the four
divergence-detection boundary cases is added in a follow-up commit.

Refs: SDK-4388
Add four test cases to RefreshUserOperationExecutorTests for the
self-heal divergence detection added in the previous commit:

1. divergence: server reports the cached push subscription as disabled
   while the local model is opted-in and SUBSCRIBED -> exactly one
   UpdateSubscriptionOperation is returned with the correct ids,
   enabled=true, status=SUBSCRIBED, and the device's local address.

2. healthy: server matches local (enabled=true, notificationTypes=1)
   -> no follow-up op (existing healthy path stays unchanged).

3. opted-out: local optedIn=false even though server is also disabled
   -> no follow-up op (UNSUBSCRIBE is the user's intent, not drift).

4. NO_PERMISSION: local status=NO_PERMISSION with server disabled
   -> no follow-up op (OS-level denial is the real source of truth).

Includes a buildSelfHealHarness helper that wires up the mocks for the
four cases with minimal duplication. The test that triggers the WARN
log line temporarily lowers Logging.logLevel to NONE to avoid the
unmocked android.util.Log.w call in JVM unit tests.

Refs: SDK-4388
…ption`

Co-authored-by: abdulraqeeb33 <abdulraqeeb33@gmail.com>
Detekt's ReturnCount rule defaults to a max of 2; the previous body had
3 (early null/type guard, divergence guard, final result) which forced
the suppression. Collapse the divergence branch into a trailing
`return if (divergent) ... else null` expression so the function uses
exactly two `return` keywords:

  1. early `return null` for the cache miss / non-push guard
  2. final `return if (divergent) { ... } else { null }`

No behavior change. Addresses review feedback on #2636.

Refs: SDK-4474
Strip "SDK-4388" / "SDK-4474" prefixes from comments and test descriptions
in RefreshUserOperationExecutor and its test class. The descriptive context
that followed the ticket id already explains the why; the ticket id itself
is noise for future maintainers and is preserved in git history / commit
messages.
Rename the local variable in RefreshUserOperationExecutor.getUser from
pushSelfHealOperation -> pushSelfHealOperationForStuckSubscription so it
matches the builder method buildPushSelfHealOperationForStuckSubscription
that produces it. Pure rename, no behavior change.
@sherwinski sherwinski force-pushed the sherwin/sdk-4474-self-heal-stuck-push-subscriptions branch from 5792403 to 58c51d1 Compare May 12, 2026 23:32
Base automatically changed from ar/sdk-4388-android-sdk-subscription-state-permanently-stuck-at-never to main May 15, 2026 21:54
@sherwinski sherwinski merged commit 67cd220 into main May 15, 2026
3 checks passed
@sherwinski sherwinski deleted the sherwin/sdk-4474-self-heal-stuck-push-subscriptions branch May 15, 2026 21:54
@github-actions github-actions Bot mentioned this pull request May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants