Skip to content

Delete task (DeleteAndMerge) on a large split causes unbounded memory (RSS) growth → OOM crashloop #6534

Description

@zchataiwala-abstract

Version: 0.8.0 (x86_64-unknown-linux-gnu)
Backend: GCS object storage, PostgreSQL metastore, janitor running only the delete-task service.

Summary

A single delete-by-query that matches only a handful of documents triggers a DeleteAndMerge on every split containing a match. For large splits (~11M docs / ~17 GB on disk, ~29 fast fields), the merge completes successfully, but memory then grows unbounded after merge-operation-success until the process is OOM-killed. It reproduces at every memory limit we tried (1 GiB, 14 GiB, and 40 GiB), so it is not a sizing problem — the working set grows past whatever ceiling is set.

Smaller splits in the same delete task (≤4.6M docs) finalize fine; only the large ones (≥6.7M docs) blow up.

Repro

  1. Index with many fast fields (~29) and large merged splits (split_num_docs_target = 10_000_000; observed splits up to ~11.3M docs / ~17 GB).
  2. Submit a delete-by-query matching a few docs spread across several splits.
  3. Janitor delete-task service plans one DeleteAndMerge per affected split.
  4. On the large splits: merge logs merge-operation-success, then RSS climbs steadily to the cgroup limit and the pod is OOM-killed. On restart it re-plans the same splits → crashloop. Each cycle also leaves orphaned Staged splits behind (uploaded/staged but never published).

Observed (instrumented)

  • RSS (anonymous heap), not page cache, is what grows — confirmed via cgroup/cAdvisor (container_memory_rss climbs from ~4 GB to ~38 GB under a 40 GB limit, then OOM). ~2 CPU cores pegged throughout (compute-bound, not I/O).
  • The growth happens after merge-operation-success is logged, i.e. in the post-merge packaging/finalize phase, and no publish-new-splits is ever logged for the large splits.
  • We ruled out the obvious bounded allocations by reading the 0.8.0 source: split upload streams from disk in bounded chunks; hotcache is sparse; tag extraction bails above MAX_VALUES_PER_TAG_FIELD. None scales with split size unboundedly — so the RSS growth appears to be an accumulation/leak in the delete-merge finalize path.

Impact

  • A trivial delete (a few docs) can wedge the janitor indefinitely on any index that has large splits, and orphans Staged splits in object storage on every crash cycle.
  • Workaround: disable the delete-task service and keep split_num_docs_target small so delete-merges only ever operate on small splits — but this does not resolve already-large splits, which only clear via retention/reindex.

Questions for maintainers

  • Is this a known issue / is there a fix in a later release?
  • Is the post-merge memory for a delete-merge expected to scale with split size (and if so, roughly how)? A bound proportional to split size would at least make sizing predictable; the unbounded growth we see suggests a leak.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions