Skip to content

chore(deps): bump go-openaudio ETL to halt-on-block-error (#323)#883

Merged
raymondjacobson merged 1 commit into
mainfrom
api/bump-etl-halt-on-error
May 29, 2026
Merged

chore(deps): bump go-openaudio ETL to halt-on-block-error (#323)#883
raymondjacobson merged 1 commit into
mainfrom
api/bump-etl-halt-on-error

Conversation

@raymondjacobson
Copy link
Copy Markdown
Member

Summary

Picks up OpenAudio/go-openaudio#323.

Before that change, a processBlock failure in pkg/etl was swallowed with continue, advancing past the failed block. core_indexed_blocks is keyed by MAX(height), so the api's health_check?max_core_indexer_block_diff=... probes look healthy while the skipped block is silently lost forever — and the prefetcher works forward from the indexed tip so it never re-hands a skipped block. That's the same pattern that produced the ray52726 dropped-signup we tracked down earlier this week (blocks_number_key collision on the per-block tx → continue → silent gap).

After this bump, processBlock failure causes indexBlocks() to return the error. The pod crash-restart path picks up from MAX(core_indexed_blocks.height), the prefetcher re-hands the previously-failed block, and we retry it. Transient failures self-heal on one restart; persistent corruption crashloops loudly.

Bump details

from to
github.com/OpenAudio/go-openaudio v1.3.1-0.20260529221831-4d1c9dfdfb52 v1.3.1-0.20260529230137-819100b28c94
github.com/OpenAudio/go-openaudio/pkg/etl v1.3.1-0.20260529221831-4d1c9dfdfb52 v1.3.1-0.20260529230137-819100b28c94

gh api .../compare/4d1c9dfdfb52...819100b2 confirms this is one commit only#323's merge, no drive-bys.

Operational note

This trades "silent data loss" for "loud crashloop on persistent corruption." Worth confirming that pod-restart / unhealthy-pod alerting on core-indexer is wired so the loud failure doesn't also go unnoticed. If it isn't, that's a quick follow-up — but even without it this PR is strictly better than today.

Test plan

  • go build ./..., go vet ./... clean.
  • go mod tidy clean (no transitive surprises).
  • ./indexer/... tests pass (the existing user_pubkey + user_events hook tests exercise the pkg/etl integration path).
  • After deploy, confirm blocks_behind stays at 0 (no behavior change on the happy path).
  • Confirm operator alerts fire on a synthetic crashloop, if not already verified.

🤖 Generated with Claude Code

Picks up OpenAudio/go-openaudio#323. Before this, a processBlock failure in
pkg/etl was swallowed with `continue`, advancing past the failed block. That
leaves a hidden core_indexed_blocks gap because MAX(height) jumps ahead, so
the api/health_check block_diff probes look healthy while the skipped block
is silently lost forever — the same pattern that produced the ray52726
dropped-signup earlier this week.

After this bump, processBlock failure halts indexBlocks() and returns the
error. The pod crash-restart path picks up from MAX(core_indexed_blocks.height),
the prefetcher re-hands us the failed block, and we retry. Transient failures
self-heal on one restart; persistent corruption crashloops loudly (worth
verifying pod-restart alerting is wired — covered in the #323 review notes).

Bump is one-commit-clean (4d1c9dfdfb52..819100b28c94).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@raymondjacobson raymondjacobson merged commit 8254c70 into main May 29, 2026
5 checks passed
@raymondjacobson raymondjacobson deleted the api/bump-etl-halt-on-error branch May 29, 2026 23:17
raymondjacobson added a commit that referenced this pull request May 30, 2026
…)" (#885)

## Summary

Reverts #883, pinning go-openaudio back to
`v1.3.1-0.20260529221831-4d1c9dfdfb52`.

The halt-on-error behavior from upstream go-openaudio#323 is correct in
isolation, but is **incompatible with the current dual-run state**:
Python and api-side ETL both write to overlapping tables, and the
on-chain plays bridge from #881 doesn't ON CONFLICT-protect against rows
Python has already written. So:

- Pre-#883: the failure was silently swallowed by `continue` — ETL was
effectively a no-op on essentially every block since #881 deployed, but
block_diff stayed green because Python's writes kept
`MAX(blocks.height)` moving. Block-level data loss masked by Python
carrying the load.
- Post-#883: the same failure crashes the indexing loop. We saw it
tonight at `processBlock failed` on block 25415514, reproducibly across
pod restarts because Python writes the same plays in the same block
before the ETL gets to it. Once #884 (the api-wrapper fix that makes
that halt actually exit the process) ships, every pod would crashloop
the moment it tries to index any recent block.

So shipping #883 + #884 without first handling the cross-writer
collision points would convert today's silent wedge into a continuous
outage that takes the parity jobs (`IndexChallengesJob`,
`UserListeningHistory`, `HourlyPlayCounts`, etc.) down with the ETL.
Strictly worse.

## Plan

1. **This PR**: pin upstream back to the pre-halt version. Today's
silent wedge stays in place — bad, but bounded — and the parity jobs
keep ticking.
2. Close #884 (already done). The diagnosis there is correct, but it
amplifies #883's bad sequencing, so we re-land it after #883 is safe to
re-ship.
3. Revert OpenAudio/go-openaudio#323 upstream too, so no future bump
trips this accidentally.
4. **Audit + fix the cross-writer collision points in pkg/etl** — start
with the plays bridge (#881), apply the same ON CONFLICT pattern #319
used for the `blocks` table. Then sweep anywhere else ETL and Python
touch the same row.
5. Re-land go-openaudio#323, then api#883, then api#884 (in that order).
At that point the halt-on-error guarantee is honest.

## Bump details (revert direction)

| | from | to |
|---|---|---|
| `github.com/OpenAudio/go-openaudio` |
`v1.3.1-0.20260529230137-819100b28c94` |
`v1.3.1-0.20260529221831-4d1c9dfdfb52` |
| `github.com/OpenAudio/go-openaudio/pkg/etl` |
`v1.3.1-0.20260529230137-819100b28c94` |
`v1.3.1-0.20260529221831-4d1c9dfdfb52` |

## Test plan

- [x] `go build ./...` clean.
- [ ] After deploy: confirm new pod boots, no `processBlock failed` halt
log on block 25415514 (it'll go back to silent `continue`).
- [ ] Verify parity jobs still tick and block_diff stays at 0 (no
functional change vs. pre-#883 prod).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant