diff --git a/docs/plans/memory-analysis-results.md b/docs/plans/memory-analysis-results.md new file mode 100644 index 0000000000..5cb9cf5553 --- /dev/null +++ b/docs/plans/memory-analysis-results.md @@ -0,0 +1,100 @@ +## Memory Diagnostics Results + +### Data Collection + +Server: smp19.simplex.im, PostgreSQL backend, `useCache = False` +RTS flags: `+RTS -N -A16m -I0.01 -Iw15 -s -RTS` (16 cores) + +### Mar 20 Data (1 hour, 07:19-08:19) + +``` +Time rts_live rts_heap rts_large rts_frag clients non-large +07:19 7.5 GB 8.2 GB 5.5 GB 0.03 GB 14,000 2.0 GB +07:24 6.4 GB 10.8 GB 5.2 GB 3.6 GB 14,806 1.2 GB +07:29 8.2 GB 10.8 GB 6.5 GB 1.8 GB 15,667 1.7 GB +07:34 10.0 GB 12.3 GB 7.9 GB 1.4 GB 15,845 2.1 GB +07:39 6.7 GB 13.0 GB 5.3 GB 5.6 GB 16,589 1.4 GB +07:44 8.5 GB 13.0 GB 6.7 GB 3.7 GB 16,283 1.8 GB +07:49 6.5 GB 13.0 GB 5.2 GB 5.8 GB 16,532 1.3 GB +07:54 6.0 GB 13.0 GB 4.8 GB 6.3 GB 16,636 1.2 GB +07:59 6.4 GB 13.0 GB 5.1 GB 5.9 GB 16,769 1.3 GB +08:04 8.3 GB 13.0 GB 6.5 GB 3.9 GB 17,352 1.8 GB +08:09 10.2 GB 13.0 GB 8.0 GB 1.9 GB 17,053 2.2 GB +08:14 5.6 GB 13.0 GB 4.5 GB 6.8 GB 17,147 1.1 GB +08:19 7.6 GB 13.0 GB 6.1 GB 4.6 GB 17,496 1.5 GB +``` + +non-large = rts_live - rts_large (normal Haskell heap objects: Maps, TVars, closures) + +### Mar 19 Data (5.5 hours, 07:49-13:19) + +rts_heap grew from 10.1 GB to 20.7 GB over 5.5 hours. +Post-GC rts_live floor rose from 5.5 GB to 9.1 GB. + +### Findings + +**1. Large/pinned objects dominate live data (60-80%)** + +`rts_large` = 4.5-8.0 GB out of 5.6-10.2 GB live. These are allocations > ~3KB that go on GHC's large object heap. They oscillate (not growing monotonically), meaning they are being allocated and freed constantly — transient, not leaked. + +**2. Fragmentation is the heap growth mechanism** + +`rts_heap ≈ rts_live + rts_frag`. The heap grows because pinned/large objects fragment GHC's block allocator. Once GHC expands the heap, it never shrinks. Growth pattern: +- Large objects allocated → occupy blocks +- Large objects freed → blocks can't be reused if ANY other object shares the block +- New allocations need fresh blocks → heap expands +- Heap never returns memory to OS + +**3. Non-large heap data is stable (~1.0-2.2 GB)** + +Normal Haskell objects (Maps, TVars, closures, client structures) account for only 1-2 GB. This scales with client count at ~100-130 KB/client and does NOT grow over time. + +**4. All tracked data structures are NOT the cause** + +- `clientSndQ=0, clientMsgQ=0` — TBQueues empty, no message accumulation +- `smpQSubs` oscillates ~1.0-1.4M — entries are cleaned up, not leaking +- `ntfStore` < 2K entries — negligible +- All proxy agent maps near 0 +- `loadedQ=0` — useCache=False confirmed working + +**5. Source of large objects is unclear without heap profiling** + +The 4.5-8.0 GB of large objects could come from: +- PostgreSQL driver (`postgresql-simple`/`libpq`) — pinned ByteStrings for query results +- TLS library (`tls`) — pinned buffers per connection +- Network socket I/O — pinned ByteStrings for recv/send +- SMP protocol message blocks + +Cannot distinguish between these without `-hT` heap profiling (which is too expensive for this server). + +### Root Cause + +**GHC heap fragmentation from constant churn of large/pinned ByteString allocations.** + +Not a data structure leak. The live data itself is reasonable (5-10 GB for 15-17K clients). The problem is that GHC's copying GC cannot compact around pinned objects, so the heap grows with fragmentation and never shrinks. + +### Mitigation Options + +All are RTS flag changes — no rebuild needed, reversible by restart. + +**1. `-F1.2`** (reduce GC trigger factor from default 2.0) +- Triggers major GC when heap reaches 1.2x live data instead of 2x +- Reclaims fragmented blocks sooner +- Trade-off: more frequent GC, slightly higher CPU +- Risk: low — just makes GC run more often + +**2. Reduce `-A16m` to `-A4m`** (smaller nursery) +- More frequent minor GC → short-lived pinned objects freed faster +- Trade-off: more GC cycles, but each is smaller +- Risk: low — may actually improve latency by reducing GC pause times + +**3. `+RTS -xn`** (nonmoving GC) +- Designed for pinned-heavy workloads — avoids copying entirely +- Available since GHC 8.10, improved in 9.x +- Trade-off: different GC characteristics, less battle-tested +- Risk: medium — different GC algorithm, should test first + +**4. Limit concurrent connections** (application-level) +- Since large objects scale per-client, fewer clients = less fragmentation +- Trade-off: reduced capacity +- Risk: low but impacts users diff --git a/docs/plans/memory-analysis.md b/docs/plans/memory-analysis.md new file mode 100644 index 0000000000..9c45005fb2 --- /dev/null +++ b/docs/plans/memory-analysis.md @@ -0,0 +1,225 @@ +## Root Cause Analysis: SMP Server Memory Growth (23.5GB) + +### Environment + +- **Server**: smp19.simplex.im, ~21,927 connected clients +- **Storage**: PostgreSQL backend with `useCache = False` +- **RTS flags**: `+RTS -N -A16m -I0.01 -Iw15 -s -RTS` (16 cores) +- **Memory**: 23.5GB RES / 1031GB VIRT (75% of available RAM) + +### Log Summary + +- **Duration**: ~22 hours (Mar 16 12:12 → Mar 17 10:20) +- **92,277 proxy connection errors** out of 92,656 total log lines (99.6%) +- **292 unique failing destination servers**, top offender: `nowhere.moe` (12,875 errors) +- Only **145 successful proxy connections** + +--- + +### Known Factor: GHC Heap Sizing + +With 16 cores and `-A16m`: +- **Nursery**: 16 × 16MB = **256MB baseline** +- GHC default major GC threshold = **2× live data** — if live data is 10GB, heap grows to ~20GB before major GC +- The server is rarely idle with 22K clients, so major GC is deferred despite `-I0.01` +- This is an amplifier — whatever the actual live data size is, GHC roughly doubles it + +--- + +### Candidate Structures That Could Grow Unboundedly + +Analysis of the full codebase identified these structures that either grow without bound or have uncertain cleanup: + +#### 1. `SubscribedClients` maps — `Env/STM.hs:378` + +Both `subscribers.queueSubscribers` and `ntfSubscribers.queueSubscribers` (and their `serviceSubscribers`) use `SubscribedClients (TMap EntityId (TVar (Maybe (Client s))))`. + +Comment at line 376: *"The subscriptions that were made at any point are not removed"* + +`deleteSubcribedClient` IS called on disconnect (Server.hs:1112) and DOES call `TM.delete`. But it only deletes if the current stored client matches — if another client already re-subscribed, the old client's disconnect won't remove the entry. This is by design for mobile client continuity, but the net effect on map size over time is unclear without measurement. + +#### 2. ProxyAgent's subscription TMaps — `Client/Agent.hs:145-151` + +The `SMPClientAgent` has 4 TMaps that accumulate one top-level entry per unique destination server and **never remove** them: + +- `activeServiceSubs :: TMap SMPServer (TVar ...)` (line 145) +- `activeQueueSubs :: TMap SMPServer (TMap QueueId ...)` (line 146) +- `pendingServiceSubs :: TMap SMPServer (TVar ...)` (line 149) +- `pendingQueueSubs :: TMap SMPServer (TMap QueueId ...)` (line 150) + +Comment at line 262: *"these vars are never removed, they are only added"* + +These are only used for the proxy agent (SParty 'Sender), so they grow with each unique destination SMP server proxied to. With 292 unique servers in this log period, these are likely small — but long-running servers may accumulate thousands. + +`closeSMPClientAgent` (line 369) does NOT clear these 4 maps. + +#### 3. `NtfStore` — `NtfStore.hs:26` + +`NtfStore (TMap NotifierId (TVar [MsgNtf]))` — one entry per NotifierId. + +`deleteExpiredNtfs` (line 47) filters expired notifications from lists but does **not remove entries with empty lists** from the TMap. Over time, NotifierIds that no longer receive notifications leave zombie `TVar []` entries. + +`deleteNtfs` (line 44) does remove the full entry via `TM.lookupDelete` — but only called when a notifier is explicitly deleted. + +#### 4. `serviceLocks` in PostgresQueueStore — `Postgres.hs:112,469` + +`serviceLocks :: TMap CertFingerprint Lock` — one Lock per unique certificate fingerprint. + +`getCreateService` (line 469) calls `withLockMap (serviceLocks st) fp` which calls `getMapLock` (Agent/Client.hs:1029-1032) — this **unconditionally inserts** a Lock into the TMap. There is **no cleanup code** for serviceLocks anywhere. This is NOT guarded by `useCache`. + +#### 5. `sentCommands` per proxy client connection — `Client.hs:580` + +Each `PClient` has `sentCommands :: TMap CorrId (Request err msg)`. Entries are added per command sent (line 1369) and only removed when a response arrives (line 698). If a connection drops before all responses arrive, entries remain until the `PClient` is GC'd. Since `PClient` is captured by the connection thread which terminates on error, the `PClient` should become GC-eligible — but GC timing depends on heap pressure. + +#### 6. `subQ :: TQueue (ClientSub, ClientId)` — `Env/STM.hs:363` + +Unbounded `TQueue` for subscription changes. If the subscriber thread (`serverThread`) can't process changes fast enough, this queue grows without backpressure. With 22K clients subscribing/unsubscribing, sustained bursts could cause this queue to bloat. + +--- + +### Ruled Out + +1. **PostgreSQL queue cache**: `useCache = False` — `queues`, `senders`, `links`, `notifiers` TMaps are empty. +2. **`notifierLocks`**: Guarded by `useCache` (Postgres.hs:377,405) — not used with `useCache = False`. +3. **Client structures**: 22K × ~3KB = ~66MB — negligible. +4. **TBQueues**: Bounded (`tbqSize = 128`). +5. **Thread management**: `forkClient` uses weak refs + `finally` blocks. `endThreads` cleared on disconnect. +6. **Proxy `smpClients`/`smpSessions`**: Properly cleaned on disconnect/expiry. +7. **`smpSubWorkers`**: Properly cleaned on worker completion; also cleared in `closeSMPClientAgent`. +8. **`pendingEvents`**: Atomically swapped empty every `pendingENDInterval`. +9. **Stats IORef counters**: Fixed number, bounded. +10. **DB connection pool**: Bounded `TBQueue` with bracket-based return. + +--- + +### Insufficient Data to Determine Root Cause + +Without measuring the actual sizes of these structures at runtime, we cannot determine which (if any) is the primary contributor. The following exact logging changes will identify the root cause. + +--- + +### EXACT LOGS TO ADD + +Add a new periodic logging thread in `src/Simplex/Messaging/Server.hs`. + +Insert at `Server.hs:197` (after `prometheusMetricsThread_`): + +```haskell + <> memoryDiagThread_ cfg +``` + +Then define: + +```haskell + memoryDiagThread_ :: ServerConfig s -> [M s ()] + memoryDiagThread_ ServerConfig {prometheusInterval = Just _} = + [memoryDiagThread] + memoryDiagThread_ _ = [] + + memoryDiagThread :: M s () + memoryDiagThread = do + labelMyThread "memoryDiag" + Env { ntfStore = NtfStore ntfMap + , server = srv@Server {subscribers, ntfSubscribers} + , proxyAgent = ProxyAgent {smpAgent = pa} + , msgStore_ = ms + } <- ask + let interval = 300_000_000 -- 5 minutes + liftIO $ forever $ do + threadDelay interval + -- GHC RTS stats + rts <- getRTSStats + let liveBytes = gcdetails_live_bytes $ gc rts + heapSize = gcdetails_mem_in_use_bytes $ gc rts + gcCount = gcs rts + -- Server structures + clientCount <- IM.size <$> getServerClients srv + -- SubscribedClients (queue and service subscribers for both SMP and NTF) + smpQSubs <- M.size <$> getSubscribedClients (queueSubscribers subscribers) + smpSSubs <- M.size <$> getSubscribedClients (serviceSubscribers subscribers) + ntfQSubs <- M.size <$> getSubscribedClients (queueSubscribers ntfSubscribers) + ntfSSubs <- M.size <$> getSubscribedClients (serviceSubscribers ntfSubscribers) + -- Pending events + smpPending <- IM.size <$> readTVarIO (pendingEvents subscribers) + ntfPending <- IM.size <$> readTVarIO (pendingEvents ntfSubscribers) + -- NtfStore + ntfStoreSize <- M.size <$> readTVarIO ntfMap + -- ProxyAgent maps + let SMPClientAgent {smpClients, smpSessions, activeServiceSubs, activeQueueSubs, pendingServiceSubs, pendingQueueSubs, smpSubWorkers} = pa + paClients <- M.size <$> readTVarIO smpClients + paSessions <- M.size <$> readTVarIO smpSessions + paActSvc <- M.size <$> readTVarIO activeServiceSubs + paActQ <- M.size <$> readTVarIO activeQueueSubs + paPndSvc <- M.size <$> readTVarIO pendingServiceSubs + paPndQ <- M.size <$> readTVarIO pendingQueueSubs + paWorkers <- M.size <$> readTVarIO smpSubWorkers + -- Loaded queue counts + lc <- loadedQueueCounts $ fromMsgStore ms + -- Log everything + logInfo $ + "MEMORY " + <> "rts_live=" <> tshow liveBytes + <> " rts_heap=" <> tshow heapSize + <> " rts_gc=" <> tshow gcCount + <> " clients=" <> tshow clientCount + <> " smpQSubs=" <> tshow smpQSubs + <> " smpSSubs=" <> tshow smpSSubs + <> " ntfQSubs=" <> tshow ntfQSubs + <> " ntfSSubs=" <> tshow ntfSSubs + <> " smpPending=" <> tshow smpPending + <> " ntfPending=" <> tshow ntfPending + <> " ntfStore=" <> tshow ntfStoreSize + <> " paClients=" <> tshow paClients + <> " paSessions=" <> tshow paSessions + <> " paActSvc=" <> tshow paActSvc + <> " paActQ=" <> tshow paActQ + <> " paPndSvc=" <> tshow paPndSvc + <> " paPndQ=" <> tshow paPndQ + <> " paWorkers=" <> tshow paWorkers + <> " loadedQ=" <> tshow (loadedQueueCount lc) + <> " loadedNtf=" <> tshow (loadedNotifierCount lc) + <> " ntfLocks=" <> tshow (notifierLockCount lc) +``` + +Note: `smpSubs.subsCount` (queueSubscribers size) and `smpSubs.subServicesCount` (serviceSubscribers size) are **already logged** in Prometheus (lines 475-496). The log above adds all other candidate structures plus GHC RTS memory stats. + +This produces a single log line every 5 minutes: + +``` +[INFO] MEMORY rts_live=10737418240 rts_heap=23488102400 rts_gc=4521 clients=21927 smpQSubs=1847233 smpSSubs=42 ntfQSubs=982112 ntfSSubs=31 smpPending=0 ntfPending=0 ntfStore=512844 paClients=12 paSessions=12 paActSvc=0 paActQ=0 paPndSvc=0 paPndQ=0 paWorkers=3 loadedQ=0 loadedNtf=0 ntfLocks=0 +``` + +### What Each Metric Tells Us + +| Metric | What it reveals | If growing = suspect | +|--------|----------------|---------------------| +| `rts_live` | Actual live data after last major GC | Baseline — everything else should add up to this | +| `rts_heap` | Total heap (should be ~2× rts_live) | If >> 2× live, fragmentation issue | +| `clients` | Connected client count | Known: ~22K | +| `smpQSubs` | SubscribedClients map size (queue subs) | If >> clients × avg_subs, entries not cleaned | +| `smpSSubs` | SubscribedClients map size (service subs) | Should be small | +| `ntfQSubs` | NTF SubscribedClients map (queue subs) | Same concern as smpQSubs | +| `ntfSSubs` | NTF SubscribedClients map (service subs) | Should be small | +| `smpPending` / `ntfPending` | Pending END/DELD events per client | If large, subscriber thread lagging | +| `ntfStore` | NotifierId count in NtfStore | If growing monotonically, zombie entries | +| `paClients` | Proxy connections to other servers | Should be <= unique dest servers | +| `paSessions` | Active proxy sessions | Should match paClients | +| `paActSvc` / `paActQ` | Proxy active subscriptions | If growing, entries never removed | +| `paPndSvc` / `paPndQ` | Proxy pending subscriptions | If growing, resubscription stuck | +| `paWorkers` | Active reconnect workers | If growing, workers stuck in retry | +| `loadedQ` | Cached queues in store (0 with useCache=False) | Should be 0 | +| `ntfLocks` | Notifier locks in store | Should be 0 with useCache=False | + +### Interpretation Guide + +**If `smpQSubs` is in the millions**: SubscribedClients is the primary leak. Entries accumulate for every queue ever subscribed to. + +**If `ntfStore` grows monotonically**: Zombie notification entries (empty lists after expiration). Fix: `deleteExpiredNtfs` should remove entries with empty lists. + +**If `paActSvc` + `paActQ` grow**: Proxy agent subscription maps are the leak. Fix: add cleanup when no active/pending subs exist for a server. + +**If `rts_live` is much smaller than `rts_heap`**: GHC heap fragmentation. Fix: tune `-F` flag (GC trigger factor) or use `-c` (compacting GC). + +**If `rts_live` ~ 10-12GB**: The live data is genuinely large. Look at which metric is the largest contributor. + +**If nothing above is large but `rts_live` is large**: The leak is in a structure not measured here — likely TLS connection buffers, ByteString retention from Postgres queries, or GHC runtime overhead. Next step would be heap profiling with `-hT`. diff --git a/src/Simplex/Messaging/Server.hs b/src/Simplex/Messaging/Server.hs index 3d977dc8c4..3403563218 100644 --- a/src/Simplex/Messaging/Server.hs +++ b/src/Simplex/Messaging/Server.hs @@ -91,7 +91,7 @@ import qualified Data.X509 as X import qualified Data.X509.Validation as XV import GHC.Conc.Signal import GHC.IORef (atomicSwapIORef) -import GHC.Stats (getRTSStats) +import GHC.Stats (RTSStats (..), GCDetails (..), getRTSStats) import GHC.TypeLits (KnownNat) import Network.Socket (ServiceName, Socket, socketToHandle) import qualified Network.TLS as TLS @@ -198,6 +198,7 @@ smpServer started cfg@ServerConfig {transports, transportConfig = tCfg, startOpt <> serverStatsThread_ cfg <> prometheusMetricsThread_ cfg <> controlPortThread_ cfg + <> [memoryDiagThread] ) `finally` stopServer s where @@ -719,6 +720,75 @@ smpServer started cfg@ServerConfig {transports, transportConfig = tCfg, startOpt Nothing -> acc Just (_, ts) -> (cnt + 1, updateTimeBuckets ts ts' times) + memoryDiagThread :: M s () + memoryDiagThread = do + labelMyThread "memoryDiag" + Env + { ntfStore = NtfStore ntfMap, + server = srv@Server {subscribers, ntfSubscribers}, + proxyAgent = ProxyAgent {smpAgent = pa}, + msgStore_ = ms + } <- ask + let SMPClientAgent {smpClients, smpSessions, activeServiceSubs, activeQueueSubs, pendingServiceSubs, pendingQueueSubs, smpSubWorkers} = pa + liftIO $ forever $ do + threadDelay 300_000_000 -- 5 minutes + rts <- getRTSStats + let GCDetails {gcdetails_live_bytes, gcdetails_mem_in_use_bytes, gcdetails_large_objects_bytes, gcdetails_compact_bytes, gcdetails_block_fragmentation_bytes} = gc rts + clientCount <- IM.size <$> getServerClients srv + smpQSubs <- M.size <$> getSubscribedClients (queueSubscribers subscribers) + smpSSubs <- M.size <$> getSubscribedClients (serviceSubscribers subscribers) + ntfQSubs <- M.size <$> getSubscribedClients (queueSubscribers ntfSubscribers) + ntfSSubs <- M.size <$> getSubscribedClients (serviceSubscribers ntfSubscribers) + smpPending <- IM.size <$> readTVarIO (pendingEvents subscribers) + ntfPending <- IM.size <$> readTVarIO (pendingEvents ntfSubscribers) + ntfStoreSize <- M.size <$> readTVarIO ntfMap + paClients' <- M.size <$> readTVarIO smpClients + paSessions' <- M.size <$> readTVarIO smpSessions + paActSvc <- M.size <$> readTVarIO activeServiceSubs + paActQ <- M.size <$> readTVarIO activeQueueSubs + paPndSvc <- M.size <$> readTVarIO pendingServiceSubs + paPndQ <- M.size <$> readTVarIO pendingQueueSubs + paWorkers <- M.size <$> readTVarIO smpSubWorkers + lc <- loadedQueueCounts $ fromMsgStore ms + -- per-client metrics: total subscriptions and queue fill + clients <- getServerClients srv + let clientsList = IM.elems clients + totalSubs <- sum <$> mapM (\Client {subscriptions} -> M.size <$> readTVarIO subscriptions) clientsList + totalSndQ <- sum <$> mapM (\Client {sndQ} -> fromIntegral <$> atomically (lengthTBQueue sndQ)) clientsList + totalMsgQ <- sum <$> mapM (\Client {msgQ} -> fromIntegral <$> atomically (lengthTBQueue msgQ)) clientsList + totalEndThreads <- sum <$> mapM (\Client {endThreads} -> IM.size <$> readTVarIO endThreads) clientsList + logInfo $ + "MEMORY" + <> " rts_live=" <> tshow gcdetails_live_bytes + <> " rts_heap=" <> tshow gcdetails_mem_in_use_bytes + <> " rts_max_live=" <> tshow (max_live_bytes rts) + <> " rts_large=" <> tshow gcdetails_large_objects_bytes + <> " rts_compact=" <> tshow gcdetails_compact_bytes + <> " rts_frag=" <> tshow gcdetails_block_fragmentation_bytes + <> " rts_gc=" <> tshow (gcs rts) + <> " clients=" <> tshow clientCount + <> " clientSubs=" <> tshow totalSubs + <> " clientSndQ=" <> tshow (totalSndQ :: Int) + <> " clientMsgQ=" <> tshow (totalMsgQ :: Int) + <> " clientThreads=" <> tshow totalEndThreads + <> " smpQSubs=" <> tshow smpQSubs + <> " smpSSubs=" <> tshow smpSSubs + <> " ntfQSubs=" <> tshow ntfQSubs + <> " ntfSSubs=" <> tshow ntfSSubs + <> " smpPending=" <> tshow smpPending + <> " ntfPending=" <> tshow ntfPending + <> " ntfStore=" <> tshow ntfStoreSize + <> " paClients=" <> tshow paClients' + <> " paSessions=" <> tshow paSessions' + <> " paActSvc=" <> tshow paActSvc + <> " paActQ=" <> tshow paActQ + <> " paPndSvc=" <> tshow paPndSvc + <> " paPndQ=" <> tshow paPndQ + <> " paWorkers=" <> tshow paWorkers + <> " loadedQ=" <> tshow (loadedQueueCount lc) + <> " loadedNtf=" <> tshow (loadedNotifierCount lc) + <> " ntfLocks=" <> tshow (notifierLockCount lc) + runClient :: Transport c => X.CertificateChain -> C.APrivateSignKey -> TProxy c 'TServer -> c 'TServer -> M s () runClient srvCert srvSignKey tp h = do ms <- asks msgStore