From 34a211f24e750b2fbe6887aef34eb3e1fe7d300e Mon Sep 17 00:00:00 2001 From: Craig Soules Date: Sun, 10 May 2026 14:14:48 -0700 Subject: [PATCH 1/4] Docs for the coordinator --- python/coordinator/README.md | 301 +++++++++++++++++++++++++++++++++++ 1 file changed, 301 insertions(+) create mode 100644 python/coordinator/README.md diff --git a/python/coordinator/README.md b/python/coordinator/README.md new file mode 100644 index 000000000..c52f7f675 --- /dev/null +++ b/python/coordinator/README.md @@ -0,0 +1,301 @@ +# Springtail Coordinator + +The coordinator (`coordinator.py`) is the supervisor process that manages the +lifecycle of every springtail service running on a host. One coordinator runs +per host and is responsible for a single service type: `ingestion`, `fdw`, or +`proxy`. + +The coordinator: + +1. Downloads and installs binaries from S3 (production only). +2. Registers the components for its service type with the `Scheduler`. +3. Starts the components in dependency order. +4. Monitors components via PID checks, Redis liveness timeouts, and a Redis + pub/sub channel, restarting on failure. +5. Reacts to state transitions written to Redis (`coordinator_state` and, for + FDW, `fdw_state`) so that operators can drive lifecycle changes from + outside the host. + +State and orchestration are coordinated through Redis. The coordinator +reads/writes two state values that drive the four operations described in +this document: + +| Key | Field | Type | Values | +| --- | --- | --- | --- | +| `:coordinator_state` | `:` | `CoordinatorState` | `startup`, `reloading`, `running`, `reload`, `shutdown`, `dead` | +| `:fdw` | `` (JSON `state` field) | string | `initialize`, `running`, `draining`, `stopped` | + +--- + +## 1. Starting Springtail + +A springtail database instance consists of three service tiers, each launched +on its own host(s) by an independent coordinator process: + +| Service | Components started (in order) | +| --- | --- | +| `ingestion` | `pg_log_mgr_daemon` | +| `fdw` | `postgres` → `pg_xid_subscriber_daemon` → `pg_ddl_daemon` | +| `proxy` | `proxy` | + +### Launch command + +The coordinator is launched as: + +```bash +python coordinator.py -c config.yaml -s [--debug] [--manual] [-f /path/to/postmaster.pid] +``` + +In production it is normally started by `systemd` as the +`springtail-coordinator` unit; the same command runs interactively from the +shell for development. + +Flags: + +- `-c, --config-file` — path to the local `config.yaml` (defaults to + `config.yaml` in the working directory). The YAML supplies + `system_json_path`, `install_dir`, the `production` flag, and log rotation + settings. +- `-s, --service` — one of `ingestion`, `fdw`, or `proxy`. May be supplied via + the `SERVICE_NAME` environment variable instead. +- `--debug` — verbose logging. +- `--manual` — in production mode, skip the loader self-restart so the + coordinator can be driven from a shell. +- `-f, --postgres-pid-file` — explicit path to the postgres PID file (FDW + service only). + +### Startup sequence + +`Coordinator.startup()` (in `coordinator.py`) executes: + +1. **Read coordinator state from Redis.** The state field is keyed by + `:` so each coordinator on the host has its own + record. +2. **Install binaries (production only).** If state is `STARTUP`, the + `Production` helper downloads the latest `springtail-*.tgz` from + `s3:///packages/`, validates the package's `Config Hash` in + `INFO.txt` against the value Redis has for the deployment, and rsyncs it + to `install_dir`. The state is then advanced to `RELOADING` and (unless + `--manual` was passed) `loader.startup` is invoked to relaunch the + coordinator from the freshly installed code, after which the current + process exits. The loader will bring the coordinator back up under the + new binary set. +3. **Sanity-check paths.** The mount path, log path, and `/bin/system` + directory are verified to exist. +4. **Stop stragglers.** `stop_daemons` kills any orphaned daemons listed in + `ALL_DAEMONS` so the coordinator starts from a clean slate. +5. **Build the scheduler and register components.** A `ComponentFactory` + produces the `Component` objects for the selected service. For `fdw`, the + coordinator first calls `Production.install_pgfdw` to install the + `springtail_fdw` extension into the local PostgreSQL, then waits for the + ingestion service to be reachable (pings `XidMgrClient` and + `SysTblMgrClient` until both respond) before starting postgres and the + FDW daemons. +6. **Start components in order.** `Scheduler.start_all` launches each + component by ascending `startup_order` and waits for each to be running + before moving on. Once everything is up, the coordinator state is set to + `RUNNING`. +7. **Enter the monitor loop.** `Scheduler.monitor_timeouts` loops every + ~5 seconds: + - Reacts to coordinator-state changes (see "Reload" and "Stopping" below). + - Tracks per-database state changes and emits SNS notifications. + - Reads the liveness hash in Redis and flags components whose + heartbeat is older than `allowed_timeout` (40s by default). + - Drains the `pubsub:liveness_notify` channel for failure messages from + the daemons. + - Calls `Component.is_alive()` on every registered component. + - On any failure, calls `Scheduler.restart_all` (subject to the + `MAX_FAILURES` / `FAILURE_WINDOW_THRESHOLD` rate limiter — 5 failures + within 5 seconds aborts the loop and waits 5 minutes before retrying). + - For the FDW service, also runs `_process_fdw_state_change` to honor the + external "remove replica" flow described below. + +Once `monitor_timeouts` exits, `shutdown_all` is invoked, the coordinator +state is set to `DEAD`, and the process exits. + +### Reloading binaries on a running coordinator + +To pick up a new build without manually restarting, set the coordinator state +to `RELOAD`: + +```python +props.set_coordinator_state(CoordinatorState.RELOAD) +``` + +The monitor loop's `_check_coordinator_state` will: + +1. Call `shutdown_all` on every registered component. +2. Re-run `install_binaries` (and `install_pgfdw` for the FDW service). +3. Call `restart_all` to bring the components back under the new binaries. +4. Set the state back to `RUNNING`. + +--- + +## 2. Adding Replica Nodes (FDW Hosts) + +Springtail "replica nodes" are FDW hosts: instances running the `fdw` service +that expose a PostgreSQL endpoint backed by the springtail foreign data +wrapper. Each FDW host is identified by a unique `FDW_ID` and has its own +config record under the Redis hash `:fdw`. + +The coordinator does not provision new hosts itself — that is the +responsibility of the controller / infrastructure layer. From the +coordinator's perspective, adding a replica is simply "stand up a new host +and start an FDW coordinator on it." The flow is: + +1. **Provision an FDW config record.** Add the new FDW to + `system.json` under `fdws` (or insert it directly into Redis at + `:fdw` and `:fdw_ids`). Newly added FDW + configs are written with `state = "initialize"` (see + `Properties._load_redis`). +2. **Provision the host.** Bring up an EC2 instance (or equivalent) that + has PostgreSQL installed locally and has the springtail environment + variables set. The coordinator requires: + - `SERVICE_NAME=fdw` + - `FDW_ID=` + - `INSTANCE_KEY` (used as part of the coordinator-state record key) + - The standard Redis / AWS / mount-path env vars + (`REDIS_HOSTNAME`, `REDIS_PORT`, `REDIS_USER`, `REDIS_PASSWORD`, + `REDIS_USER_DATABASE_ID`, `REDIS_CONFIG_DATABASE_ID`, + `ORGANIZATION_ID`, `ACCOUNT_ID`, `DATABASE_INSTANCE_ID`, + `LUSTRE_DNS_NAME`, `LUSTRE_MOUNT_NAME`, `MOUNT_POINT`, + `SNS_TOPIC_ARN`, `AWS_REGION`). +3. **Start the coordinator on the new host.** Either via `systemctl start + springtail-coordinator` (production) or directly: + + ```bash + python coordinator.py -c config.yaml -s fdw + ``` + + The coordinator will run the standard FDW startup sequence: + + - Download the binaries from S3 and install them locally + (`Production.install_binaries`). + - Install the `springtail_fdw` extension into the host's PostgreSQL + and rewrite `pg_hba.conf` / the postgres environment file + (`Production.install_pgfdw`). + - Wait for the ingestion service to be reachable. + - Start `postgres`, `pg_xid_subscriber_daemon`, and `pg_ddl_daemon`. + - Set the coordinator state to `RUNNING`. + + The FDW daemons themselves transition the FDW config's `state` field + away from `initialize` (toward `running`) once they have synchronized + their state. +4. **Route traffic.** Once the new FDW is healthy, the proxy can begin + routing connections to it. The proxy reads its target list from Redis, + so no coordinator-side action is needed beyond the new FDW reaching + `running`. + +There is no "add replica" RPC inside the coordinator: each FDW is added by +spinning up a new host and pointing a new coordinator at the right +`FDW_ID`. + +--- + +## 3. Removing Replica Nodes (Draining an FDW Host) + +Removal is the case the coordinator itself implements explicitly, in +`Scheduler._process_fdw_state_change` (called every monitor iteration when +the service type is `fdw`). The flow is operator-driven via the FDW state +field: + +1. **Operator marks the FDW as draining.** Externally (controller, admin + tool, or `Properties.set_fdw_state('draining')`), set the FDW config's + `state` to `draining` in Redis at `:fdw[]`. +2. **Coordinator detects the draining state.** On its next monitor tick the + FDW coordinator reads the FDW config (`get_fdw_config(nocache=True)`), + sees `state == 'draining'`, and enters the drain path. +3. **Wait for connections to drain.** The coordinator polls + `PostgresComponent.get_connection_count()` (which runs + `select count(client_port) from pg_stat_activity where client_port != -1`) + every 5 seconds and only proceeds once it returns 0. While waiting the + proxy is responsible for steering new connections away from this FDW. +4. **Shut down all components.** `Scheduler.shutdown_all` stops `pg_ddl_daemon`, + `pg_xid_subscriber_daemon`, and `postgres` in reverse startup order. +5. **Wait for `stopped`.** The coordinator calls + `Properties.wait_for_fdw_state('stopped', 30)`. The expectation is that + the operator (or controller) advances the FDW state to `stopped` once + they have confirmed the host is quiesced. If the wait times out, the + coordinator forces the state to `stopped` itself. +6. **Clean up Redis state.** The coordinator removes everything in Redis + that this FDW owned: + - `DELETE :queue:ddl:fdw:` (DDL work queue). + - `HDEL` matching members of `:hash:ddl:fdw` whose key + ends in `:`. + - `HDEL` matching members of `:fdw_min_xids` whose key + starts with `:`. + - `SREM` matching members of `:fdw_pids` whose value + starts with `:`. +7. **Stay idle.** The coordinator sets `services_stopped = True`. The + monitor loop continues running but skips its component checks + (`IDLE_SLEEP_INTERVAL = 5s`), so the host can be safely terminated by + the controller. To fully stop the coordinator, follow the "Stopping + Springtail" flow below. + +The FDW config record itself is left in Redis — the controller is +responsible for removing it from `system.json`/`:fdw` once +the host is gone if the FDW should not be re-used. + +--- + +## 4. Stopping Springtail + +There are two ways to stop a coordinator and its components, depending on +whether the stop should be initiated locally (process signal) or remotely +(Redis state). + +### A. Local stop (signals) + +The coordinator installs handlers for `SIGINT` and `SIGTERM` +(`make_signal_handler` in `coordinator.py`). Either signal: + +1. Sets `coordinator.shutdown_event`, which causes the startup wait loops + (e.g. `_wait_for_ingestion`) to exit early. +2. Calls `Scheduler.shutdown()`, which sets the scheduler's + `shutdown_event` and breaks `monitor_timeouts` out of its loop. +3. Once the loop exits, `Scheduler.shutdown_all` shuts down components in + reverse startup order (`shutdown` first, then `kill` if the graceful + shutdown times out). +4. The Redis pub/sub subscription and connection are closed, and the + coordinator state is set to `DEAD`. +5. After `Coordinator.startup()` returns, `coordinator.py`'s `__main__` + calls `stop_daemons` for `ALL_DAEMONS` to be sure no orphans are left, + then sends an SNS `shutdown` notification (production only). + +In production, this is normally done with: + +```bash +sudo systemctl stop springtail-coordinator +``` + +### B. Remote stop (coordinator state) + +To stop a running coordinator from outside the host, write `SHUTDOWN` to +Redis: + +```python +props.set_coordinator_state(CoordinatorState.SHUTDOWN) +``` + +On the next monitor tick `_check_coordinator_state` matches `SHUTDOWN` and +calls `Scheduler.shutdown()`, which then follows the same teardown path as +the signal-driven case (reverse-order component shutdown, pub/sub close, +state set to `DEAD`). + +### C. Stopping the whole instance + +To stop springtail across all hosts: + +1. **Drain proxies first** by stopping their coordinators (signal or + Redis-state). With proxies down, no new client traffic can reach FDW + hosts. +2. **Drain each FDW host** using the "Removing Replica Nodes" flow + (`set_fdw_state('draining')`), which gracefully waits for in-flight + connections, shuts down `pg_ddl_daemon`, `pg_xid_subscriber_daemon`, and + `postgres`, and cleans up the per-FDW Redis state. Then stop each FDW + coordinator (signal or `coordinator_state = SHUTDOWN`). +3. **Stop the ingestion coordinator** (signal or + `coordinator_state = SHUTDOWN`); this brings down `pg_log_mgr_daemon`. + +When every coordinator has reached the `DEAD` state in +`:coordinator_state`, the instance is fully stopped. From fe77f6d6fb736487746c511275f5ba5ecf989029 Mon Sep 17 00:00:00 2001 From: Craig Soules Date: Sun, 10 May 2026 14:20:25 -0700 Subject: [PATCH 2/4] Add reference to the coordinator README at the top level --- README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/README.md b/README.md index b9a235082..e790ba31d 100644 --- a/README.md +++ b/README.md @@ -255,6 +255,12 @@ psql -h localhost -p 55432 -U postgres ./cluster down all ``` +## Quick Start: Production Deployment + +In a production deployment, each Springtail host runs a `coordinator` process that supervises the daemons for one service tier (`ingestion`, `fdw`, or `proxy`), installs binaries from S3, monitors component liveness via Redis, and reacts to lifecycle state changes (startup, reload, drain, shutdown) driven by the controller. + +For details on starting Springtail, adding replica (FDW) nodes, removing replica nodes, and stopping Springtail, see [`python/coordinator/README.md`](python/coordinator/README.md). + ## License This project is licensed under the Elastic License 2.0 (ELv2). From 3ecbb95b63e9b9e75f843a026af516a468663d15 Mon Sep 17 00:00:00 2001 From: Craig Soules Date: Mon, 11 May 2026 18:58:56 -0700 Subject: [PATCH 3/4] Fixes #903 (#1202) Fix for XidMgr_Test.ThreadedTest --- include/sys_tbl_mgr/table.hh | 13 +++++++-- src/storage/test/test_table.cc | 45 +++++++++++++++++++++++++++++++ src/xid_mgr/test/threaded_test.cc | 20 +++++++------- 3 files changed, 67 insertions(+), 11 deletions(-) diff --git a/include/sys_tbl_mgr/table.hh b/include/sys_tbl_mgr/table.hh index 1a87dba7d..fc72b80b2 100644 --- a/include/sys_tbl_mgr/table.hh +++ b/include/sys_tbl_mgr/table.hh @@ -184,6 +184,10 @@ namespace indexer_helpers { const Tracker& tb = b; if (ta == tb) { + // Vacant table: both trackers carry null btrees; nothing more to compare. + if (a._btree == nullptr) { + return true; + } return (a._btree_i == a._btree->end() || a._page_i == b._page_i); } return false; @@ -496,8 +500,8 @@ namespace indexer_helpers { */ Iterator end(uint64_t index_id = constant::INDEX_PRIMARY, bool index_only = false) { - // check for vacant table - if (index_id == constant::INDEX_PRIMARY && _primary_index == nullptr) { + // check for vacant table - return a vacant iterator regardless of index type + if (_primary_index == nullptr) { return Iterator(this); } return Iterator(this, index_id, index_only); @@ -512,6 +516,11 @@ namespace indexer_helpers { if (idx == 0) { return _primary_index; } + // Vacant table: _secondary_indexes is empty; return null instead of throwing, + // matching the primary-index behavior on a vacant table. + if (_primary_index == nullptr) { + return nullptr; + } return _secondary_indexes.at(idx).first; } diff --git a/src/storage/test/test_table.cc b/src/storage/test/test_table.cc index 99bd36dc6..9396978e2 100644 --- a/src/storage/test/test_table.cc +++ b/src/storage/test/test_table.cc @@ -260,6 +260,51 @@ namespace { uint64_t Table_Test::_db_id = 1; std::filesystem::path Table_Test::_base_dir; + TEST_P(Table_Test, CreateVacant) { + auto client = XidMgrClient::get_instance(); + auto server = xid_mgr::XidMgrServer::get_instance(); + uint64_t access_xid = client->get_committed_xid(1, 0); + uint64_t target_xid = access_xid + 1; + + // Register the table in sys_tbl_mgr but do NOT create any data files + // (skip _create_mtable / finalize). The table directory will not exist. + _init_sys_tbls(target_xid, 9999, "test_create_vacant"); + server->commit_xid(1, 1, target_xid, false, 0); + + // Roots all UNKNOWN_EXTENT; primary (0), one secondary (1), look-aside (max). + std::vector roots = { + {0, constant::UNKNOWN_EXTENT}, + {1, constant::UNKNOWN_EXTENT}, + {std::numeric_limits::max(), constant::UNKNOWN_EXTENT} + }; + + auto table = _create_table(9999, target_xid, roots); + + // Sanity: confirm the table is truly vacant (directory does not exist). + ASSERT_FALSE(std::filesystem::exists(table->get_dir_path())); + + auto key = _create_key("aaaa"); + + // Primary index: search ops + iterator comparison must not crash. + ASSERT_TRUE(table->has_primary()); + ASSERT_TRUE(table->empty()); + ASSERT_TRUE(table->primary_lookup(key) == constant::UNKNOWN_EXTENT); + ASSERT_TRUE(table->lower_bound(key) == table->end()); + ASSERT_TRUE(table->upper_bound(key) == table->end()); + ASSERT_TRUE(table->inverse_lower_bound(key) == table->end()); + ASSERT_TRUE(table->begin() == table->end()); + + // Secondary index (id 1): used to throw because _secondary_indexes was empty. + ASSERT_TRUE(table->lower_bound(key, 1) == table->end(1)); + ASSERT_TRUE(table->upper_bound(key, 1) == table->end(1)); + ASSERT_TRUE(table->inverse_lower_bound(key, 1) == table->end(1)); + ASSERT_TRUE(table->begin(1) == table->end(1)); + + // index() should return nullptr for both primary and secondary on a vacant table. + ASSERT_TRUE(table->index(0) == nullptr); + ASSERT_TRUE(table->index(1) == nullptr); + } + TEST_P(Table_Test, CreateEmpty) { auto client = XidMgrClient::get_instance(); auto server = xid_mgr::XidMgrServer::get_instance(); diff --git a/src/xid_mgr/test/threaded_test.cc b/src/xid_mgr/test/threaded_test.cc index bb8ed236d..73071d18c 100644 --- a/src/xid_mgr/test/threaded_test.cc +++ b/src/xid_mgr/test/threaded_test.cc @@ -110,9 +110,11 @@ namespace { for (int i = 0; i < iterations; i++) { std::unique_lock lock(_mx); uint64_t xid = client->get_committed_xid(1, 0); - auto now = std::chrono::system_clock::now(); - auto timestamp = std::chrono::duration_cast(now.time_since_epoch()); - server->commit_xid(1, 1, xid + 1, false, timestamp.count()); + // Monotonic by construction. system_clock can step backward + // (NTP, container clock skew) which would trip the server's + // monotonic-timestamp invariant. + uint64_t timestamp = ++_next_timestamp; + server->commit_xid(1, 1, xid + 1, false, timestamp); } LOG_INFO("Thread: {}, finished", thread_id); @@ -121,6 +123,9 @@ namespace { std::vector _threads; std::unique_ptr _subscriber; std::mutex _mx; + uint64_t _next_timestamp = static_cast( + std::chrono::duration_cast( + std::chrono::system_clock::now().time_since_epoch()).count()); }; TEST_F(XidMgr_Test, ThreadedTest) @@ -150,12 +155,9 @@ namespace { xid_mgr::XidMgrServer *server = xid_mgr::XidMgrServer::get_instance(); uint64_t xid = client->get_committed_xid(1, 0); - auto now = std::chrono::system_clock::now(); - auto timestamp = std::chrono::duration_cast(now.time_since_epoch()); - - server->commit_xid(1, 1, xid + 1, false, timestamp.count()); - server->commit_xid(1, 1, xid + 2, false, timestamp.count()); - server->commit_xid(1, 1, xid + 3, false, timestamp.count()); + server->commit_xid(1, 1, xid + 1, false, ++_next_timestamp); + server->commit_xid(1, 1, xid + 2, false, ++_next_timestamp); + server->commit_xid(1, 1, xid + 3, false, ++_next_timestamp); sleep(1); From 3ed85bf930befc3ba9a8d7686df67762e274908c Mon Sep 17 00:00:00 2001 From: Craig Soules Date: Mon, 11 May 2026 22:45:19 -0400 Subject: [PATCH 4/4] more updates to the README --- python/coordinator/README.md | 94 +++++++++++++++++++++++++++++++++--- 1 file changed, 86 insertions(+), 8 deletions(-) diff --git a/python/coordinator/README.md b/python/coordinator/README.md index c52f7f675..b9df1bc76 100644 --- a/python/coordinator/README.md +++ b/python/coordinator/README.md @@ -5,6 +5,13 @@ lifecycle of every springtail service running on a host. One coordinator runs per host and is responsible for a single service type: `ingestion`, `fdw`, or `proxy`. +> **Note:** the coordinator is intended as a *reference example* of how to +> deploy Springtail end-to-end (binary install from S3, systemd-managed +> postgres, Redis-driven lifecycle, SNS notifications, etc.). It encodes the +> assumptions of one particular deployment — paths, AWS services, secret +> layout, systemd unit naming, and so on — and will likely need to be +> adjusted (or replaced) to match a different target environment. + The coordinator: 1. Downloads and installs binaries from S3 (production only). @@ -87,10 +94,11 @@ Flags: 5. **Build the scheduler and register components.** A `ComponentFactory` produces the `Component` objects for the selected service. For `fdw`, the coordinator first calls `Production.install_pgfdw` to install the - `springtail_fdw` extension into the local PostgreSQL, then waits for the - ingestion service to be reachable (pings `XidMgrClient` and - `SysTblMgrClient` until both respond) before starting postgres and the - FDW daemons. + `springtail_fdw` extension into the local PostgreSQL (see + [Host PostgreSQL prerequisites](#host-postgresql-prerequisites) for what + this assumes about the cluster), then waits for the ingestion service to + be reachable (pings `XidMgrClient` and `SysTblMgrClient` until both + respond) before starting postgres and the FDW daemons. 6. **Start components in order.** `Scheduler.start_all` launches each component by ascending `startup_order` and waits for each to be running before moving on. Once everything is up, the coordinator state is set to @@ -141,16 +149,86 @@ config record under the Redis hash `:fdw`. The coordinator does not provision new hosts itself — that is the responsibility of the controller / infrastructure layer. From the coordinator's perspective, adding a replica is simply "stand up a new host -and start an FDW coordinator on it." The flow is: +and start an FDW coordinator on it." See +[Host PostgreSQL prerequisites](#host-postgresql-prerequisites) first for +what each FDW host must have in place before the coordinator can run; the +flow below assumes those prerequisites are met. + +### Host PostgreSQL prerequisites + +The FDW coordinator does not install or initialize PostgreSQL — it expects +a specific cluster to already be running on the host. The local-cluster +ansible at `local-cluster/opt/springtail/helpers/customize-pg.yml` is the +canonical reference for how that cluster is laid down; the requirements +below summarize what the coordinator itself actually depends on. + +**1. The custom (patched) PostgreSQL 16 build.** Springtail does not run on +stock PostgreSQL. The FDW host must run the patched PG 16 fork (currently +`postgresql-16.9`, distributed as `springtail-pg.tar.gz`), which adds RLS +support for foreign tables that the `springtail_fdw` extension relies on. +The reference build is in `docker/ansible/roles/custom-pg/tasks/main.yml`: +it downloads the tarball from `custom_pg_package_url`, configures with +`--prefix=/usr/lib/postgresql/16` (with `--bindir` / `--datadir` / +`--libdir` under that prefix), and runs `make install-world`. + +`install_pgfdw` and `PostgresComponent` discover the install layout at +runtime via `pg_config --sharedir`, `--pkglibdir`, `--bindir`, and +`--version`, so **the `pg_config` first on the coordinator's PATH must +resolve to the custom build**. If a stock `pg_config` shadows it, the FDW +extension files will be copied into the wrong cluster. + +**2. Cluster naming driven by the FDW superuser.** The data directory, +systemd unit, OS owner, and environment file are all named after the +`fdw_superuser` username pulled from AWS Secrets Manager (secret +`sk/{org_id}/{account_id}/aws/dbi/{db_instance_id}/primary_db_password`, +read via `Properties.get_role(DB_USER_ROLE_FDW)`). Throughout this section +`{fdw_user}` is that username and `{pg_version}` is the PG major version +returned by `pg_config --version` (currently `16`): + +| What | Path / name | +| --- | --- | +| Data directory | `/var/lib/postgresql/{pg_version}/{fdw_user}` | +| `pg_hba.conf` | `/var/lib/postgresql/{pg_version}/{fdw_user}/pg_hba.conf` | +| `postmaster.pid` | `/var/lib/postgresql/{pg_version}/{fdw_user}/postmaster.pid` (overridable via the `-f` flag) | +| Environment file | `/etc/postgresql/{pg_version}/{fdw_user}/environment` | +| systemd unit | `postgresql-{fdw_user}.service` | + +The OS user `{fdw_user}` must exist and have passwordless sudo — the +coordinator runs `sudo cp`, `sudo systemctl start/stop ...`, and +`sudo -u {fdw_user} psql ...` for liveness and connection-count probes. + +The maintenance database that `is_alive` / `get_connection_count` connect +to comes from `PGDATABASE` (defaults to `__springtail` in the env-driven +path, `postgres` when properties are loaded from a `config.yaml`) and must +already exist in the cluster. + +**3. What `install_pgfdw` mutates on every coordinator start/reload.** It +does not create the cluster, but it does modify it on each invocation (see +`python/coordinator/production.py:201`): + +- Copies the FDW extension into the cluster's `sharedir/extension` + (`springtail_fdw--1.0.sql`, `springtail_fdw.control`) and the shared + library into `pkglibdir` as `springtail_fdw.so`. +- Removes any stale `libspringtail_pgext.so` from the springtail install's + `shared-lib` directory. +- Rewrites the local-auth line in `pg_hba.conf` to `scram-sha-256`. +- Writes the springtail runtime env vars (the `ENV_VARS` list in + `production.py` — Redis, AWS, mount, FDW, `LD_LIBRARY_PATH`) into the + postgres environment file so the systemd unit picks them up. +- Stops postgres, so the scheduler can start it cleanly under the refreshed + environment in the next step of the startup sequence. + +### Flow 1. **Provision an FDW config record.** Add the new FDW to `system.json` under `fdws` (or insert it directly into Redis at `:fdw` and `:fdw_ids`). Newly added FDW configs are written with `state = "initialize"` (see `Properties._load_redis`). -2. **Provision the host.** Bring up an EC2 instance (or equivalent) that - has PostgreSQL installed locally and has the springtail environment - variables set. The coordinator requires: +2. **Provision the host.** Bring up an EC2 instance (or equivalent) with + the custom PostgreSQL cluster configured per + [Host PostgreSQL prerequisites](#host-postgresql-prerequisites) and the + springtail environment variables set. The coordinator requires: - `SERVICE_NAME=fdw` - `FDW_ID=` - `INSTANCE_KEY` (used as part of the coordinator-state record key)