diff --git a/pgcopydb-helpers/AGENTS.md b/pgcopydb-helpers/AGENTS.md index 14d5f88..148976a 100644 --- a/pgcopydb-helpers/AGENTS.md +++ b/pgcopydb-helpers/AGENTS.md @@ -256,6 +256,35 @@ Displays a full migration progress dashboard: phase completion status, table/ind --- +#### `check-copy-stall.sh` + +Diagnoses a stalled or slow COPY by inspecting live session, lock, and wait state on the **target** database. Use it when `check-migration-status.sh` shows a COPY that does not seem to be progressing, to determine *why*. + +```bash +~/check-copy-stall.sh +~/check-copy-stall.sh --no-sample # skip the throughput sample +~/check-copy-stall.sh --sample-secs 15 # longer throughput window +``` + +**Sections:** +1. **Blocking tree** — sessions blocked on a lock and the session holding it (`pg_blocking_pids`). Empty = nothing is lock-blocked. +2. **Ungranted locks** — anything waiting to acquire a lock. +3. **Wait-event summary** — histogram of backends by `wait_event_type`/`wait_event`, to classify the limiter (source feed vs disk IO vs WAL vs lock vs replication). +4. **Active COPY operations** — `pg_stat_progress_copy` with wait state and age, oldest first. +5. **Running vacuums** — `pg_stat_progress_vacuum` with phase and % scanned. +6. **Idle-in-transaction** — parked worker connections that pin the xmin horizon. +7. **Ingest throughput** — WAL written and DB growth sampled over a few seconds (proves data is landing even if one stream looks idle). + +**Interpreting it:** The most common cause of an apparent stall is *not* a lock. A COPY backend in `wait_event = ClientRead` is waiting on the source feed (pgcopydb / source / network), not on the target. Autovacuum cannot lock-block a COPY (`ShareUpdateExclusiveLock` vs `RowExclusiveLock` don't conflict) but a full-table vacuum on a table being loaded steals IO/WAL bandwidth and throttles throughput — mitigate with per-table `ALTER TABLE ... SET (autovacuum_enabled = false)` during the load, re-enabled + `ANALYZE` after. pgcopydb cycles parts through a fixed worker pool, so COPY pids appear and disappear normally; judge progress by section 7 and changing row counts, not a single per-pid snapshot. + +**When to use:** Whenever a migration looks stalled or unusually slow during the COPY phase. Run it a couple of times to compare. + +**Requires:** `PGCOPYDB_TARGET_PGURI` + +**Read-only** — connects with `default_transaction_read_only=on`, a statement timeout, and a lock timeout; makes no modifications to the target. + +--- + #### `check-cdc-status.sh` Displays CDC-specific replication progress: apply and streaming LSN positions, backlog gap, apply rate, ETA to catch-up, and source replication slot health. diff --git a/pgcopydb-helpers/README.md b/pgcopydb-helpers/README.md index dff4286..9bcd019 100644 --- a/pgcopydb-helpers/README.md +++ b/pgcopydb-helpers/README.md @@ -178,6 +178,14 @@ Check overall migration progress (copy, indexes, constraints, vacuum) and see ac ~/check-migration-status.sh ``` +If a COPY looks stalled or unusually slow during the initial copy, diagnose why on the target — blocking locks, wait events, competing vacuums, and whether data is still landing: + +```bash +~/check-copy-stall.sh +``` + +It is read-only and reports whether anything is actually lock-blocked (usually nothing is — a COPY waiting in `ClientRead` is waiting on the source feed, not the target) and whether autovacuum is throttling throughput. + Once the initial copy completes and CDC is streaming, check replication progress: ```bash @@ -412,6 +420,7 @@ sqlite3 ~/migration_*/schema/filter.db "SELECT COUNT(*) FROM s_depend;" | `run-migration.sh` | Migrate | Start a pgcopydb clone --follow migration | | `start-migration-screen.sh` | Migrate | Run the migration in a detached screen session. | | `check-migration-status.sh` | Monitor | Migration progress dashboard | +| `check-copy-stall.sh` | Monitor | Diagnose a stalled/slow COPY on the target (locks, waits, vacuums, throughput) | | `check-cdc-status.sh` | Monitor | CDC replication progress and health | | `slack-migration-alerts.sh` | Monitor | Slack alerts | | `resume-migration.sh` | Recovery | Resume an interrupted migration (full clone + CDC) | diff --git a/pgcopydb-helpers/check-copy-stall.sh b/pgcopydb-helpers/check-copy-stall.sh new file mode 100755 index 0000000..7f2bc1f --- /dev/null +++ b/pgcopydb-helpers/check-copy-stall.sh @@ -0,0 +1,265 @@ +#!/bin/bash +# +# Usage: ~/check-copy-stall.sh [--no-sample] [--sample-secs N] +# +# Diagnoses a stalled or slow pgcopydb COPY by inspecting live session, lock, +# and wait state on the TARGET (PlanetScale) database. Answers the two questions +# that matter when a migration looks stuck: +# +# 1. Is anything actually blocked on a lock (and who holds it)? +# 2. Where is time being spent — waiting on the client/source feed, on disk +# IO, on WAL, or on a lock — and is data still landing at all? +# +# Read-only: connects with default_transaction_read_only=on and a statement +# timeout. Makes no modifications to the target. Run it from the migration +# instance, the same place the migration runs. +# +# Requires: PGCOPYDB_TARGET_PGURI (from ~/.env) +# +set -eo pipefail + +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' + +# --- Options --- +SAMPLE=true +SAMPLE_SECS=5 +while [ $# -gt 0 ]; do + case "$1" in + --no-sample) SAMPLE=false; shift ;; + --sample-secs) + SAMPLE_SECS="${2:-5}" + [ "$SAMPLE_SECS" -gt 0 ] 2>/dev/null || { echo "--sample-secs must be a positive integer" >&2; exit 1; } + shift 2 ;; + -h|--help) + sed -n '2,/^set -eo/p' "$0" | sed 's/^# \{0,1\}//; /^set -eo/d' + exit 0 ;; + *) echo "Unknown option: $1" >&2; exit 1 ;; + esac +done + +# --- Load environment --- +set +u +set -a +# shellcheck disable=SC1090 +source ~/.env +set +a +set -u + +if [ -z "${PGCOPYDB_TARGET_PGURI:-}" ]; then + echo -e "${RED}✗ PGCOPYDB_TARGET_PGURI is not set (check ~/.env)${NC}" + exit 1 +fi + +# Read-only, time-bounded psql. default_transaction_read_only guarantees the +# session cannot write even if a query were changed to do so; statement_timeout +# and lock_timeout keep a diagnostic from ever hanging on a busy target. +PGOPTS='-c default_transaction_read_only=on -c statement_timeout=30000 -c lock_timeout=5000' +run_table() { PGOPTIONS="$PGOPTS" psql "$PGCOPYDB_TARGET_PGURI" -X -q -P pager=off -P footer=off -c "$1" || true; } +run_scalar() { PGOPTIONS="$PGOPTS" psql "$PGCOPYDB_TARGET_PGURI" -X -q -t -A -c "$1" 2>/dev/null || echo ""; } + +echo -e "${BLUE}╔════════════════════════════════════════════════════════════════╗${NC}" +echo -e "${BLUE}║ PlanetScale Migration — COPY Stall Diagnostics (target side) ║${NC}" +echo -e "${BLUE}╚════════════════════════════════════════════════════════════════╝${NC}" +echo "" + +# --- Connectivity --- +TARGET_DB=$(run_scalar "SELECT current_database()") +if [ -z "$TARGET_DB" ]; then + echo -e "${RED}✗ Could not connect to the target database via PGCOPYDB_TARGET_PGURI${NC}" + echo " Check the URI in ~/.env and network access from this instance." + exit 1 +fi +echo "Target database: $TARGET_DB" +echo "Generated: $(date '+%Y-%m-%d %H:%M:%S %Z')" +echo "" + +# ── 1. Is anything lock-blocked? (the headline question) ── +echo "────────────────────────────────────────────────────────────────" +echo "1. BLOCKING TREE (who is waiting on a lock, and who holds it)" +echo "────────────────────────────────────────────────────────────────" +BLOCKED_COUNT=$(run_scalar "SELECT count(*) FROM pg_stat_activity b WHERE b.pid <> pg_backend_pid() AND cardinality(pg_blocking_pids(b.pid)) > 0") +BLOCKED_COUNT=$(echo "${BLOCKED_COUNT:-0}" | tr -d '[:space:]') +if [ "${BLOCKED_COUNT:-0}" -gt 0 ] 2>/dev/null; then + echo -e "${RED}✗ $BLOCKED_COUNT session(s) blocked on a lock:${NC}" + run_table " + SELECT blocked.pid AS blocked_pid, + coalesce(blocked.wait_event_type||':'||blocked.wait_event,'-') AS waiting_on, + blocking.pid AS blocking_pid, + blocking.state AS blocker_state, + (now()-blocking.xact_start)::interval(0) AS blocker_xact_age, + left(regexp_replace(blocked.query, E'[\\n\\r]+',' ','g'),45) AS blocked_query, + left(regexp_replace(blocking.query,E'[\\n\\r]+',' ','g'),45) AS blocker_query + FROM pg_stat_activity blocked + JOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid)) + WHERE blocked.pid <> pg_backend_pid() + ORDER BY blocked.pid;" + echo "" + echo " → A COPY here is genuinely lock-blocked. Look at blocking_pid:" + echo " if it is 'idle in transaction', a stuck client transaction is the cause." +else + echo -e "${GREEN}✓ Nothing is lock-blocked.${NC} No session is waiting on a lock held by another." +fi +echo "" + +# ── 2. Locks that haven't been granted yet ── +echo "────────────────────────────────────────────────────────────────" +echo "2. UNGRANTED LOCKS (sessions waiting to acquire a lock)" +echo "────────────────────────────────────────────────────────────────" +UNGRANTED=$(run_scalar "SELECT count(*) FROM pg_locks WHERE NOT granted") +UNGRANTED=$(echo "${UNGRANTED:-0}" | tr -d '[:space:]') +if [ "${UNGRANTED:-0}" -gt 0 ] 2>/dev/null; then + echo -e "${YELLOW}$UNGRANTED ungranted lock request(s):${NC}" + run_table " + SELECT a.pid, + coalesce(a.wait_event_type||':'||a.wait_event,'-') AS wait, + l.locktype, l.mode, + coalesce(l.relation::regclass::text,'-') AS relation, + pg_blocking_pids(a.pid) AS blocked_by, + left(regexp_replace(a.query,E'[\\n\\r]+',' ','g'),40) AS query + FROM pg_stat_activity a + JOIN pg_locks l ON l.pid = a.pid AND NOT l.granted + ORDER BY a.pid;" +else + echo -e "${GREEN}✓ No ungranted locks.${NC} Nothing is queued waiting to acquire a lock." +fi +echo "" + +# ── 3. Where is everyone spending time? ── +echo "────────────────────────────────────────────────────────────────" +echo "3. WAIT-EVENT SUMMARY (what the backends are doing right now)" +echo "────────────────────────────────────────────────────────────────" +run_table " + SELECT backend_type, + coalesce(state,'-') AS state, + coalesce(wait_event_type,'(running)') AS wait_type, + coalesce(wait_event,'-') AS wait_event, + count(*) AS sessions + FROM pg_stat_activity + WHERE backend_type IS NOT NULL + GROUP BY 1,2,3,4 + ORDER BY sessions DESC;" +echo "" +echo " Reading it: Client/ClientRead on COPYs = waiting on the source feed" +echo " (pgcopydb / source / network — NOT a target-side problem). IO/DataFileRead" +echo " or IO/DataFileWrite = disk-bound. IO/WALWrite = WAL-bound. Lock/LWLock =" +echo " contention — chase it in sections 1-2. IPC/SyncRep = waiting on a replica ack." +echo "" + +# ── 4. Active COPY operations and their progress ── +echo "────────────────────────────────────────────────────────────────" +echo "4. ACTIVE COPY OPERATIONS (oldest first = stall suspects)" +echo "────────────────────────────────────────────────────────────────" +COPY_COUNT=$(run_scalar "SELECT count(*) FROM pg_stat_progress_copy") +COPY_COUNT=$(echo "${COPY_COUNT:-0}" | tr -d '[:space:]') +if [ "${COPY_COUNT:-0}" -gt 0 ] 2>/dev/null; then + run_table " + SELECT p.pid, + p.relid::regclass AS target_table, + pg_size_pretty(p.bytes_processed) AS copied, + p.tuples_processed AS rows, + coalesce(a.wait_event_type||':'||a.wait_event,'(running)') AS wait, + (now()-a.query_start)::interval(0) AS copy_age + FROM pg_stat_progress_copy p + JOIN pg_stat_activity a USING (pid) + ORDER BY a.query_start;" + echo "" + echo " bytes_total is 0 for COPY FROM STDIN (streamed), so judge progress by" + echo " rows/copied changing between runs — not by a single snapshot. pgcopydb" + echo " cycles parts through its worker pool, so pids come and go normally." +else + echo " No COPY operations currently running on the target." +fi +echo "" + +# ── 5. Vacuums competing with the load ── +echo "────────────────────────────────────────────────────────────────" +echo "5. RUNNING VACUUMS (autovacuum contends for IO/WAL during a load)" +echo "────────────────────────────────────────────────────────────────" +VAC_COUNT=$(run_scalar "SELECT count(*) FROM pg_stat_progress_vacuum") +VAC_COUNT=$(echo "${VAC_COUNT:-0}" | tr -d '[:space:]') +if [ "${VAC_COUNT:-0}" -gt 0 ] 2>/dev/null; then + run_table " + SELECT a.pid, + p.relid::regclass AS table, + p.phase, + coalesce(a.wait_event,'(running)') AS wait, + CASE WHEN p.heap_blks_total > 0 + THEN round(100.0*p.heap_blks_scanned/p.heap_blks_total,1) END AS pct_scanned, + (now()-a.xact_start)::interval(0) AS age + FROM pg_stat_progress_vacuum p + JOIN pg_stat_activity a USING (pid) + ORDER BY (now()-a.xact_start) DESC;" + echo "" + echo " Autovacuum does NOT lock-block a COPY (lock modes don't conflict), but a" + echo " full-table vacuum on a table being loaded steals IO/WAL bandwidth and can" + echo " throttle throughput. To reduce it during the load, on the target:" + echo " ALTER TABLE