Skip to content

perf: bound trigram search index size with LEFT() truncation#2897

Open
dkindlund wants to merge 1 commit intoteableio:developfrom
dkindlund:feat/search-index-truncation
Open

perf: bound trigram search index size with LEFT() truncation#2897
dkindlund wants to merge 1 commit intoteableio:developfrom
dkindlund:feat/search-index-truncation

Conversation

@dkindlund
Copy link
Copy Markdown
Contributor

Summary

Wraps GIN trigram index expressions with LEFT(expression, N) to bound index size. This addresses severe index bloat and write amplification on tables with many fields.

  • Adds SEARCH_INDEX_TRUNCATE_LENGTH env var (default: 1000, set to 0 to disable)
  • Wraps index expressions in FieldFormatter.getIndexExpression() with LEFT((expr)::text, N)
  • Existing indexes auto-rebuild via getAbnormalIndex() detection on next reconciliation cycle
  • Search queries unaffected (WHERE clause uses full column value via getSearchableExpression())
  • SQLite backend unaffected (all index methods return NO_OPERATION_SQL)

Motivation

Production data from a 34K-row table with 70+ fields:

Metric Value
Data size 117 MB
Total table size 5.8 GB
Trigram indexes 3.6 GB (70 indexes)
Index:data ratio 31:1
Largest unused index html_content at 731 MB (0 search scans)
Indexes with 0 scans 13 indexes, 1.9 GB total

This caused:

  • Formula field creation stalled for 90+ minutes (34K rows × 70 index writes per row)
  • 3 container crashes during the stalled operation
  • Write amplification on every INSERT/UPDATE affecting ingest throughput

With LEFT(expr, 1000), the html_content index shrinks from ~731 MB to ~20-30 MB (only indexing the first 1K chars of each 20KB+ HTML blob). Short fields like title and status are unaffected since their values are already under 1000 characters.

Changes

File Change
threshold.config.ts Add searchIndexTruncateLength config from env var
search-index-builder.postgres.ts Add truncateLength to constructor and getIndexExpression()
db.provider.interface.ts Update searchIndex() signature
postgres.provider.ts Pass truncateLength to IndexBuilderPostgres constructor
sqlite.provider.ts Update signature (parameter ignored)
table-index.service.ts Add getSearchIndexBuilder() helper, pass config to all call sites
search-index-builder.postgres.spec.ts Unit tests for truncation behavior

How It Works

Index creation:

-- Before (unbounded)
CREATE INDEX idx_trgm_... USING gin (("html_content") gin_trgm_ops)

-- After (bounded to first 1000 chars)
CREATE INDEX idx_trgm_... USING gin ((LEFT(("html_content")::text, 1000)) gin_trgm_ops)

Search queries (unchanged):

-- WHERE clause still uses full column value
WHERE ("html_content") ILIKE '%search_term%'

PostgreSQL handles this correctly: the truncated index is used for candidate row selection, then the full-column WHERE clause filters to exact matches.

Deployment

  1. Set SEARCH_INDEX_TRUNCATE_LENGTH=1000 (or use default)
  2. On next index reconciliation, getAbnormalIndex() detects all existing indexes as abnormal (definition mismatch)
  3. repairIndex() drops and recreates all indexes with the LEFT() expression
  4. One-time rebuild; subsequent checks find matching definitions

To revert: set SEARCH_INDEX_TRUNCATE_LENGTH=0 (triggers rebuild without LEFT())

Follow-on Enhancement

Per-field configurable truncate length via field metadata, allowing admins to set different thresholds for different fields (e.g., title gets full indexing, html_content gets 500 bytes). This PR uses a global threshold as the initial implementation.

Test Plan

  • Unit tests for getIndexExpression() with/without truncation
  • Unit tests for createSingleIndexSql() output verification
  • Unit tests for getAbnormalIndex() detecting old-format indexes
  • Integration test: create field, verify index uses LEFT() expression
  • Manual test: deploy, verify search still works in UI
  • Manual test: compare table size before/after rebuild

🤖 Generated with Claude Code

Teable auto-creates GIN trigram indexes (idx_trgm_*) on every field for
search. On large tables with many fields, this causes massive index bloat
(e.g., 3.6 GB of indexes on 117 MB of data) and severe write amplification
on every INSERT/UPDATE.

This commit wraps index expressions with LEFT(expression, N) to bound
index size to the first N characters per field value. N is configurable
via SEARCH_INDEX_TRUNCATE_LENGTH env var (default: 1000). Setting it to
0 disables truncation (preserving current behavior).

For short fields (< N chars), LEFT() is a no-op. For large JSON/HTML
fields, it dramatically reduces index size while preserving search
functionality (PostgreSQL uses the truncated index for candidate
selection, then applies the full-column WHERE clause for filtering).

Existing indexes are automatically rebuilt on the next index
reconciliation cycle when getAbnormalIndex() detects the definition
mismatch. If the env var changes between reboots, the same mechanism
triggers a rebuild with the new threshold.

Production data that motivated this change:
- Articles table: 117 MB data, 5.8 GB total (3.6 GB indexes)
- 70 trigram indexes, 13 largest with zero search scans (1.9 GB)
- html_content index: 731 MB, 0 scans
- Formula field backfill (34K rows): 90+ min, 3 container crashes

Follow-on enhancement: per-field configurable truncate length via field
metadata, rather than a single global threshold.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant