Skip to content

Add skipSecondaryIndexes option to skip building secondary indexes#18761

Draft
deepthi912 wants to merge 9 commits into
apache:masterfrom
deepthi912:forward-index-and-dictionary-only
Draft

Add skipSecondaryIndexes option to skip building secondary indexes#18761
deepthi912 wants to merge 9 commits into
apache:masterfrom
deepthi912:forward-index-and-dictionary-only

Conversation

@deepthi912

@deepthi912 deepthi912 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a new table-level config tableIndexConfig.skipSecondaryIndexes (default false). When enabled:

  1. New segments are built with only the forward index. BaseSegmentCreator skips every non-forward index creator during the per-row build loop and short-circuits updatePostSegmentCreationIndexes (so star-tree is also skipped). Dictionary is still written for dict-encoded columns because the forward index needs it to resolve dict-ids.

  2. Existing segments load as-is and preprocess does not add/remove secondary indexes. SegmentPreProcessor.process() runs only the ForwardIndexHandler (still allows forward-encoding upgrades) and default-column materialization; it skips the loop over all other index handlers, plus star-tree and multi-col text. Min/max generation is also skipped — it is purely a query-time pruning hint. needProcess() mirrors the same skips so preprocess isn't triggered just to add a secondary index. The on-disk loader (PhysicalColumnIndexContainer) is intentionally untouched, so segments with pre-existing secondary indexes keep serving them.

Motivation

Useful for workloads that don't need secondary indexes during initial ingest so the ingestion and segment building is faster(or for an async "build indexes later" flow where the flag is later flipped off and a reload runs the missing handlers).

Config

{
  "tableIndexConfig": {
    "skipSecondaryIndexes": true
  }
}

Reload / minion override — setOverrideSkipSecondaryIndexes

skipSecondaryIndexes is meant to govern initial server-side load and fresh ingestion only. Reload paths and minion-driven segment rebuilds need to materialize secondary indexes synchronously against the latest table config, even when the persisted flag is true. To avoid forcing users to flip the table config back and forth, both IndexLoadingConfig and SegmentGeneratorConfig carry a transient setOverrideSkipSecondaryIndexes(true) flag that suppresses the skip just for the in-flight operation. The override lives on the per-call config instance, is never persisted, and is automatically applied at these entry points:

  • Server-side reload paths in BaseTableDataManagerreloadSegment, reloadSegments, reloadAllSegments, and doReplaceSegment (segment-refresh from SegmentRefreshMessage). So POST /segments/{table}/reload, any controller-triggered reload, and minion SegmentRefreshMessage all build secondary indexes synchronously.
  • Minion segment-rebuild executorsSegmentProcessorFramework (used by MergeRollupTask, RealtimeToOfflineSegmentsTask, UpsertCompactMergeTask), UpsertCompactionTaskExecutor, RefreshSegmentTaskExecutor, and MaterializedViewTaskExecutor flip the override on the SegmentGeneratorConfig they hand to the build driver, so the rebuilt segment carries all secondary indexes.

Net effect: the only path that actually honors skipSecondaryIndexes=true end-to-end is the Helix state-transition (boot / segment first-load) and the initial offline/realtime build. Any subsequent reload or minion rebuild produces a fully-indexed segment.

@deepthi912 deepthi912 added the index Related to indexing (general) label Jun 15, 2026
@deepthi912 deepthi912 force-pushed the forward-index-and-dictionary-only branch from 6a95f14 to 43ec9f5 Compare June 15, 2026 07:10
@codecov-commenter

codecov-commenter commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.77%. Comparing base (5617ee7) to head (1a3d761).
⚠️ Report is 27 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18761      +/-   ##
============================================
+ Coverage     64.76%   64.77%   +0.01%     
  Complexity     1309     1309              
============================================
  Files          3380     3381       +1     
  Lines        209573   209955     +382     
  Branches      32805    32888      +83     
============================================
+ Hits         135735   136008     +273     
- Misses        62914    62985      +71     
- Partials      10924    10962      +38     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 64.77% <100.00%> (+0.01%) ⬆️
temurin 64.77% <100.00%> (+0.01%) ⬆️
unittests 64.77% <100.00%> (+0.01%) ⬆️
unittests1 56.97% <100.00%> (+0.01%) ⬆️
unittests2 37.25% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@deepthi912 deepthi912 marked this pull request as ready for review June 16, 2026 01:47
Drops the new IndexingConfig.skipSecondaryIndexes flag in favor of the existing
tableIndexConfig.skipSegmentPreprocess. The reload override mechanism that was
on the new flag (setOverrideSkipSecondaryIndexes) moves onto the existing flag
(setOverrideSkipSegmentPreprocess). All four reload entry points in
BaseTableDataManager and the four minion-build sites (SegmentProcessorFramework,
UpsertCompactionTaskExecutor, RefreshSegmentTaskExecutor, MaterializedViewTaskExecutor)
use the renamed override.

Net effect end-to-end is unchanged from the prior commit set: server-boot load
honors skipSegmentPreprocess and ships forward-only; reload paths (manual /reload,
AsyncIndexReloader on the STP side, segment-refresh from minion) flip the override
on the transient IndexLoadingConfig and run full preprocess including secondary
indexes; minion segment rebuilds set the override on SegmentGeneratorConfig so
the freshly-built segment carries all secondary indexes regardless of the flag.

Reverts the per-handler gating in SegmentPreProcessor and the new test methods
in SegmentPreProcessorTest, since skipSegmentPreprocess already short-circuits
the entire preprocess via ImmutableSegmentLoader.needPreprocess().
Drops the four explicit setOverrideSkipSegmentPreprocess(true) calls from the
public reload entry points and replaces them with one setter at the top of
each of the two shared protected workers:

- BaseTableDataManager.reloadSegment(SegmentDataManager, IndexLoadingConfig,
  boolean) — funnels reloadSegment / reloadAllSegments / reloadSegments
- BaseTableDataManager.replaceSegmentIfCrcMismatch(SegmentDataManager,
  SegmentZKMetadata, IndexLoadingConfig) — funnels doReplaceSegment / minion
  SegmentRefreshMessage

The reload code path itself now implicitly ignores
tableIndexConfig.skipSegmentPreprocess — callers don't need to remember to set
anything, and a stray future reload entry point can't accidentally honor the
skip flag. Boot / state-transition loads continue to use plain
fetchIndexLoadingConfig() and respect the persisted flag.
…ocess method

Drops the centralization that made every reload silently ignore
skipSegmentPreprocess. Existing reload methods (reloadSegment, reloadAllSegments,
reloadSegments, doReplaceSegment) restore upstream behavior and honor the flag.

Adds a new explicit method instead:
  TableDataManager.reloadSegmentBypassingSkipPreprocess(String, boolean, String)
  InstanceDataManager.reloadSegmentBypassingSkipPreprocess(String, String, boolean, String)

Implementations:
  BaseTableDataManager — fetches IndexLoadingConfig, sets the per-call override,
    then delegates to the existing protected reloadSegment worker.
  HelixInstanceDataManager — looks up the table data manager and delegates.

Default interface methods delegate to the upstream reloadSegment so existing
custom implementations keep working without recompile. Callers that need to
materialize secondary indexes despite the persisted flag (AsyncIndexReloader's
TierSegmentPublishResource) call the new method explicitly.
Drops the build-time changes that made new segments forward-only at creation:
  - BaseSegmentCreator (per-column / post-creation gating)
  - SegmentGeneratorConfig (build-time override field + accessors)
  - SegmentProcessorFramework + 3 minion executors (which all set the build-time override)

The PR now contains only the two essential files for the runtime reload override:
  - IndexLoadingConfig: transient override field + setter + override-aware
    isSkipSegmentPreprocess()
  - BaseTableDataManager: two inline setter calls — one in the protected
    reloadSegment worker (funneling reloadSegment/reloadAllSegments/reloadSegments)
    and one in replaceSegmentIfCrcMismatch (funneling SegmentRefreshMessage)

skipSegmentPreprocess remains a load-time-only knob in stock Pinot. New segments
get built with whatever indexes the table config asks for. Server boot honors
the flag. Any reload (manual /reload, AsyncIndexReloader, minion refresh)
ignores the flag for the duration of that single reload's IndexLoadingConfig
instance — letting secondary indexes get materialized when the user actually
asks for a reload, without persisting any config change.

@xiangfu0 xiangfu0 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one high-signal issue; see inline comment.

// is discarded when the reload finishes. Centralizing it here means every public reload entry point
// (reloadSegment / reloadAllSegments / reloadSegments) gets the reload semantics for free; boot /
// state-transition loads continue to use a plain fetchIndexLoadingConfig() and honor skipSegmentPreprocess.
indexLoadingConfig.setOverrideSkipSegmentPreprocess(true);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the contract of the existing tableIndexConfig.skipSegmentPreprocess flag for every reload path, not just the async index-materialization flow. Today a table that sets this flag can rely on reloads honoring the persisted forward-only load policy; after this override, any manual/controller reload will silently build all secondary indexes instead. That is a behavior change for an existing config with real CPU/disk blast radius for current deployments. Please gate the bypass behind a new explicit reload API/flag instead of unconditionally overriding the persisted setting here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will going to do that! Ack

@deepthi912 deepthi912 marked this pull request as draft June 18, 2026 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

index Related to indexing (general)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants