Skip to content

feat: Add setting to force enable/disable text index support for a source#2277

Open
pulpdrew wants to merge 1 commit into
mainfrom
drew/force-enable-text-index
Open

feat: Add setting to force enable/disable text index support for a source#2277
pulpdrew wants to merge 1 commit into
mainfrom
drew/force-enable-text-index

Conversation

@pulpdrew
Copy link
Copy Markdown
Contributor

@pulpdrew pulpdrew commented May 14, 2026

Summary

This PR adds a new source setting which allows for force-enabling or force-disabling text index support (generation of hasAllToken conditions for lucene implicit column search) for the source.

This is useful when

  1. A text index exists on the implicit column but HyperDX can't detect it due to a non-standard schema setup (typically one that includes the merge table engine). Force-enable handles this case.
  2. A text index is detected but shouldn't be used yet (maybe it isn't yet populated or is yielding undesirable results for some queries). Force-disable handles this case.

The default and recommended behaviour is still auto - where the index is detected at query time.

Screenshots or video

Merge engine based schema
CREATE TABLE otel_logs_local
(
    `Timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
    `TraceId` String CODEC(ZSTD(1)),
    `SpanId` String CODEC(ZSTD(1)),
    `TraceFlags` UInt32 CODEC(ZSTD(1)),
    `SeverityText` LowCardinality(String) CODEC(ZSTD(1)),
    `SeverityNumber` Int32 CODEC(ZSTD(1)),
    `ServiceName` LowCardinality(String) CODEC(ZSTD(1)),
    `Body` String CODEC(ZSTD(1)),
    `ResourceSchemaUrl` String CODEC(ZSTD(1)),
    `ResourceAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    `ScopeSchemaUrl` String CODEC(ZSTD(1)),
    `ScopeName` String CODEC(ZSTD(1)),
    `ScopeVersion` String CODEC(ZSTD(1)),
    `ScopeAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    `LogAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
    INDEX idx_res_attr_key mapKeys(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_res_attr_value mapValues(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_scope_attr_key mapKeys(ScopeAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_scope_attr_value mapValues(ScopeAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_log_attr_key mapKeys(LogAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_log_attr_value mapValues(LogAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_lower_body lower(Body) TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 8,
    INDEX idx_ngram_body lower(Body) TYPE ngrambf_v1(3, 32768, 3, 0) GRANULARITY 8,
    INDEX fts_body Body TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lowerUTF8(Body)) GRANULARITY 100000000
)
ENGINE = MergeTree
PARTITION BY toDate(Timestamp)
ORDER BY (ServiceName, SeverityText, toUnixTimestamp(Timestamp), TraceId)
TTL toDateTime(Timestamp) + toIntervalDay(30)
SETTINGS ttl_only_drop_parts = 1, index_granularity = 8192;

CREATE TABLE otel_logs_merged (
  `Timestamp` DateTime64(9),
  `TraceId` String,
  `SpanId` String,
  `TraceFlags` UInt32,
  `SeverityText` LowCardinality(String),
  `SeverityNumber` Int32,
  `ServiceName` LowCardinality(String),
  `Body` String,
  `ResourceSchemaUrl` String,
  `ResourceAttributes` Map(LowCardinality(String), String),
  `ScopeSchemaUrl` String,
  `ScopeName` String,
  `ScopeVersion` String,
  `ScopeAttributes` Map(LowCardinality(String), String),
  `LogAttributes` Map(LowCardinality(String), String)
) ENGINE = Merge ('default', '^otel_logs_local(_old)?$');

CREATE TABLE otel_logs_merged_distributed (
  `Timestamp` DateTime64(9) CODEC (Delta (8), ZSTD (1)),
  `TraceId` String CODEC (ZSTD (1)),
  `SpanId` String CODEC (ZSTD (1)),
  `TraceFlags` UInt32 CODEC (ZSTD (1)),
  `SeverityText` LowCardinality(String) CODEC (ZSTD (1)),
  `SeverityNumber` Int32 CODEC (ZSTD (1)),
  `ServiceName` LowCardinality(String) CODEC (ZSTD (1)),
  `Body` String CODEC (ZSTD (1)),
  `ResourceSchemaUrl` String CODEC (ZSTD (1)),
  `ResourceAttributes` Map(LowCardinality(String), String) CODEC (ZSTD (1)),
  `ScopeSchemaUrl` String CODEC (ZSTD (1)),
  `ScopeName` String CODEC (ZSTD (1)),
  `ScopeVersion` String CODEC (ZSTD (1)),
  `ScopeAttributes` Map(LowCardinality(String), String) CODEC (ZSTD (1)),
  `LogAttributes` Map(LowCardinality(String), String) CODEC (ZSTD (1))
) ENGINE = Distributed (
  'hdx_cluster',
  'default',
  'otel_logs_merged',
  rand()
);

With this schema, using the default (auto) setting will result in hasToken(lower(Body), token) conditions which don't use the text index effectively due to the lower() function:

Screenshot 2026-05-14 at 10 43 05 AM

When force-enabled, the same query will now use the correct hasAllToken syntax:

Screenshot 2026-05-14 at 10 43 50 AM Screenshot 2026-05-14 at 10 42 41 AM

How to test on Vercel preview

This can be tested locally by creating a schema similar to the one provided above.

References

  • Linear Issue: Closes HDX-4132 Closes HDX-2178
  • Related PRs:

@vercel
Copy link
Copy Markdown

vercel Bot commented May 14, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment May 14, 2026 3:01pm

Request Review

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 14, 2026

🦋 Changeset detected

Latest commit: 4d1ece1

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 5 packages
Name Type
@hyperdx/common-utils Patch
@hyperdx/api Patch
@hyperdx/app Patch
@hyperdx/cli Patch
@hyperdx/otel-collector Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@github-actions github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label May 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

  • Critical-path files (3):
    • packages/api/src/routers/external-api/v2/sources.ts
    • packages/api/src/tasks/checkAlerts/index.ts
    • packages/api/src/tasks/checkAlerts/template.ts
  • Cross-layer change: touches frontend (packages/app) + backend (packages/api) + shared utils (packages/common-utils)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats
  • Production files changed: 22
  • Production lines changed: 296 (+ 289 in test files, excluded from tier calculation)
  • Branch: drew/force-enable-text-index
  • Author: pulpdrew

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

PR Review

✅ No critical issues found.

A few minor observations (non-blocking):

  • ℹ️ Enabled mode intentionally bypasses both the tokenizer-type check (splitByNonAlpha) and the enable_full_text_index ClickHouse setting check that Auto performs. This is documented behavior and explicitly tested, but users who force-enable on a server with enable_full_text_index = 0 or a non-splitByNonAlpha tokenizer may hit query-time SQL errors. Worth considering a clarifying note in the form helpText ("only valid for text indices using splitByNonAlpha tokenizer").
  • ℹ️ The pattern 'useTextIndexForImplicitColumn' in source ? ... : undefined is duplicated across ~6 call sites (helpers.ts, eventPatterns.ts, ChartEditor/utils.ts, useDashboardFilterValues.tsx, ServicesDashboardPage.tsx, etc.). Could be consolidated into a small helper alongside the existing pickSampleWeightExpressionProps pattern, but consistent with current patterns.
  • ℹ️ OpenAPI spec is manually duplicated for log/trace source schemas — matches the existing approach in the file.

Test coverage looks solid: source-router CRUD/validation tests plus 7 queryParser tests covering auto-with-index, auto-without-index, enabled-without-index, enabled-with-disabled-setting, disabled-with-index, disabled-with-bloom-filter, and enabled-with-separator-fallback.

@github-actions
Copy link
Copy Markdown
Contributor

Deep Review

✅ No critical issues found. The diff is additive, backward-compatible (omitted field defaults to Auto, preserving prior behavior), and well-tested at the serializer level. Recommendations below address one missed propagation path and several testing/maintainability concerns.

🟡 P2 — recommended

  • packages/api/src/mcp/tools/query/eventPatterns.ts:156sampleConfig and countConfig forward implicitColumnExpression from a Log/Trace source but drop useTextIndexForImplicitColumn, so a user's Force enable/Force disable preference is silently ignored on the MCP runEventPatterns path.
    • Fix: Extract useTextIndexForImplicitColumn from the source the same way implicitColumn is extracted on line 138, and include it on both sampleConfig and countConfig.
  • packages/common-utils/src/types.ts:1424isSearchableSource returns boolean instead of being a type predicate, forcing the isLogSource(source) || isTraceSource(source) ? source.useTextIndexForImplicitColumn : undefined pattern to be repeated at ~10 new call sites; a missed site would silently fall back to Auto.
    • Fix: Promote isSearchableSource to source is TLogSource | TTraceSource and add a pickImplicitColumnProps(source) helper next to pickSampleWeightExpressionProps to collapse the duplicated triples.
    • maintainability, kieran-typescript
  • packages/api/src/tasks/checkAlerts/index.ts:137computeAliasWithClauses and getChartConfigFromAlert now propagate the new field but neither path has any test, so a typo or a missed branch for a new SourceKind would silently regress alert evaluation to Auto without test failure.
    • Fix: Add a test in packages/api/src/tasks/checkAlerts/__tests__/checkAlerts.test.ts that drives getChartConfigFromAlert with a Log source carrying useTextIndexForImplicitColumn: Enabled and asserts the produced ChartConfig carries the value through.
  • packages/common-utils/src/__tests__/queryParser.test.ts:1179 — new describe block covers single-token happy paths but does not exercise negation, prefix/suffix wildcards, empty term, or a non-splitByNonAlpha tokenizer paired with Enabled, all of which are observable behaviors of the new branch.
    • Fix: Add assertions for Enabled+negated single/multi-token, Disabled+negated, Enabled+wildcard fallback to LIKE, Enabled+empty term (1=1), and Enabled against a text index with tokenizer=ngram (verify hasAllTokens still emitted).
🔵 P3 nitpicks (8)
  • packages/common-utils/src/types.ts:18 — JSDoc on UseTextIndex.Disabled says "never use a text index; fall back to LIKE/hasToken", but the new test at queryParser.test.ts:1287 shows bloom_filter tokens() paths still run; SourceForm.tsx:1163 help text mirrors the same overstatement.
    • Fix: Reword both copies to "never emit hasAllTokens(); bloom_filter tokens() optimizations may still apply when available".
  • packages/common-utils/src/queryParser.ts:1199Enabled mode skips the enable_full_text_index = 1 ClickHouse setting check, so on older ClickHouse instances every query against the source fails with "Unknown function hasAllTokens"; the form offers no guard or warning.
    • Fix: Either gracefully degrade to LIKE/hasToken when enableTextIndexPromise resolves to false even in Enabled mode, or expand the form help text to call out the ClickHouse version requirement.
  • packages/common-utils/src/core/renderChartConfig.ts:544 — the pair implicitColumnExpression / useTextIndexForImplicitColumn is now threaded verbatim through 7 distinct renderWhereExpressionStr call sites in this single file; any future implicit-column option will repeat 7 times and any missed site silently reverts to Auto.
    • Fix: Extract a pickImplicitColumnFields(chartConfig) helper and spread it at each call (or accept the slice as a single arg), so additions are one-line and drift risk drops.
  • packages/api/src/routers/external-api/v2/sources.ts:357 — JSDoc description uses lowercase "lucene" while the adjacent implicitColumnExpression field on the same router uses "Lucene".
    • Fix: Capitalize "Lucene" in both new JSDoc blocks (lines 357 and 530) and regenerate openapi.json.
  • packages/app/src/components/Sources/SourceForm.tsx:1170SelectControlled shows the Auto label as a placeholder while the form value is undefined, so a user who never touches the control persists no field but sees "Auto" selected; symmetric to the persisted-auto case but the stored shape differs.
    • Fix: Either set defaultValues.useTextIndexForImplicitColumn = UseTextIndex.Auto so the displayed and persisted values match, or strip the placeholder so the empty state is visibly empty.
  • packages/common-utils/src/core/renderChartConfig.ts:822 — parameter typed as useTextIndexForImplicitColumn?: ChartConfigWithOptDateRangeEx['useTextIndexForImplicitColumn'] couples the helper signature to the chart-config type unnecessarily.
    • Fix: Type it directly as useTextIndexForImplicitColumn?: UseTextIndex.
  • packages/api/src/routers/api/__tests__/sources.test.ts:807 — invalid-value POST test only asserts expect(400) and sources.length === 0, so a regression that drops the enum check while another field fails validation would still pass.
    • Fix: Assert the response body references the rejected field (e.g., useTextIndexForImplicitColumn appears in the Zod error path).
  • packages/common-utils/src/types.ts:1311 — Zod field uses .optional() while OpenAPI/JSDoc mark it nullable: true, mirroring the pre-existing inconsistency on sibling optional source fields; clients sending null will hit a 400.
    • Fix: Either switch to .nullish() and normalize null → undefined before persistence, or drop nullable: true from the JSDoc/openapi.json for this field.

Reviewers (7): correctness, testing, maintainability, project-standards, kieran-typescript, api-contract, data-integrity

Testing gaps:

  • No end-to-end test that useTextIndexForImplicitColumn flows source → chart config → rendered SQL through buildSearchChartConfig, convertFormStateToChartConfig, or any of the 7 renderChartConfig propagation sites.
  • No coverage for Auto/Enabled/Disabled × negation, wildcards, empty term, or non-splitByNonAlpha tokenizer.
  • checkAlerts (computeAliasWithClauses, getChartConfigFromAlert, alert template) and MCP runConfigTile propagation is untested.
  • No test exercises the constructor's ?? UseTextIndex.Auto default explicitly (back-compat for legacy documents).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

E2E Test Results

All tests passed • 178 passed • 3 skipped • 1201s

Status Count
✅ Passed 178
❌ Failed 0
⚠️ Flaky 3
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

@pulpdrew pulpdrew force-pushed the drew/force-enable-text-index branch from 583106b to 4d1ece1 Compare May 14, 2026 14:58
@pulpdrew pulpdrew requested review from a team and fleon and removed request for a team May 14, 2026 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review/tier-4 Critical — deep review + domain expert sign-off

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant