Skip to content

feat(partition-ttl): Introduce KeepByEventTimeStrategy to expire partitions by event time#19081

Open
wangxianghu wants to merge 1 commit into
apache:masterfrom
wangxianghu:event_time_ttl
Open

feat(partition-ttl): Introduce KeepByEventTimeStrategy to expire partitions by event time#19081
wangxianghu wants to merge 1 commit into
apache:masterfrom
wangxianghu:event_time_ttl

Conversation

@wangxianghu

@wangxianghu wangxianghu commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

Today Hudi ships two partition TTL strategies, both keyed off processing time:

  • KeepByCreationTimeStrategy uses the partition creation commit time recorded in .hoodie_partition_metadata. Backfilled historical partitions look "newly created" and never expire.
  • KeepByTimeStrategy uses the last commit time that touched the partition. Any late-arriving write extends the partition's lifetime, silently disabling TTL.

In practice, what most users actually want is event-time semantics: the date encoded in the partition path (e.g. dt=2026-04-24) is the business event date, and anything older than N days should be dropped —
regardless of when/how writes land.

This PR introduces a new strategy KeepByEventTimeStrategy that derives the partition's reference time from the partition path itself, decoupling TTL from write behavior.

Summary and Changelog

For users: a new KEEP_BY_EVENT_TIME partition TTL strategy that expires partitions based on the date encoded in the partition path. TTL is no longer affected by backfill, late-arriving data, or commit timing.

Supported partition path shapes (v1 scope):

Day, format=yyyy-MM-dd
time only: 2026-06-27, dt=2026-06-27 — startIndex 0
prefix + time: region=us/2026-06-27, region=us/dt=2026-06-27 — startIndex 1
time + suffix: 2026-06-27/source=app, dt=2026-06-27/source=app — startIndex 0
prefix + time + suffix: region=us/dt=2026-06-27/source=app — startIndex 1

Day, format=yyyyMMdd
time only: 20260627, dt=20260627 — startIndex 0
prefix + time: region=us/20260627, region=us/dt=20260627 — startIndex 1
time + suffix: 20260627/source=app, dt=20260627/source=app — startIndex 0
prefix + time + suffix: region=us/dt=20260627/source=app — startIndex 1

Hour, format=yyyy-MM-dd/HH
time only: 2026-06-27/12, dt=2026-04-05/hh=12 — startIndex 0
prefix + time: region=us/2026-06-27/12, region=us/dt=2026-04-05/hh=12 — startIndex 1
time + suffix: 2026-06-27/12/source=app, dt=2026-04-05/hh=12/source=app — startIndex 0
prefix + time + suffix: region=us/dt=2026-04-05/hh=12/source=app — startIndex 1

Hour, format=yyyyMMdd/HH
time only: 20260627/12, dt=20260405/hh=12 — startIndex 0
prefix + time: region=us/20260627/12, region=us/dt=20260405/hh=12 — startIndex 1
time + suffix: 20260627/12/source=app, dt=20260405/hh=12/source=app — startIndex 0
prefix + time + suffix: region=us/dt=20260405/hh=12/source=app — startIndex 1

Hive-style key names are not constrained: dt=, day=, event_date=, hh=, hour= all work; only the value after = is parsed.

Hive-style is honored from the table's hoodie.datasource.write.hive_style_partitioning flag (authoritative table-level setting, not guessed per-segment).

Impact

  • Public API: none broken. New enum value, three new configs (all with defaults), one new class — strictly additive.
  • User-facing: a new opt-in TTL strategy. Existing strategies and configs are untouched. Users adopt it by setting hoodie.partition.ttl.management.strategy.type=KEEP_BY_EVENT_TIME.
  • Performance: negligible — partition path parsing is in-memory string ops, runs only inside the existing TTL replace commit, no extra metadata reads.

Risk Level

low

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR adds a KEEP_BY_EVENT_TIME partition TTL strategy that derives the partition's reference time from the date encoded in the partition path, decoupling TTL from write/commit timing. The change is cleanly additive (new enum value, three defaulted configs, one new class) and comes with good unit coverage of the parsing logic. A couple of edge cases worth double-checking in the inline comments — primarily the timezone handling and the fail-fast behavior across the partition set. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.

@wangxianghu wangxianghu force-pushed the event_time_ttl branch 2 times, most recently from c8df5d8 to d64fbc2 Compare June 27, 2026 04:20
@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Jun 27, 2026
@wangxianghu

Copy link
Copy Markdown
Contributor Author

hi @stream2000 do you have time to take a look ?

@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR adds a new KEEP_BY_EVENT_TIME partition TTL strategy that derives a partition's expiry from the date encoded in its path rather than from commit metadata, which is a useful complement to the existing commit-time strategies. The path-parsing logic is well-covered by the parameterized tests. A couple of correctness edge cases are worth double-checking in the inline comments — particularly how the inherited retention-to-millis math behaves at larger day values, and the multi-segment hive-default handling. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. One minor style inconsistency in the new strategy class — the rest of the implementation is clean and well-documented.


@Override
protected List<String> getExpiredPartitionsForTimeStrategy(List<String> partitionPathsForTTL) {
long cutoffMillis = resolveCutoffMillis(instantTime, ttlInMilis);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 resolveCutoffMillis consumes the inherited ttlInMilis, which KeepByTimeStrategy computes as getPartitionTTLStrategyDaysRetain() * 1000 * 3600 * 24 in 32-bit int arithmetic. That overflows once retention >= 25 days: e.g. 30d wraps negative (so the parent's ttlInMilis <= 0 guard silently skips all TTL), while 50d wraps to ~7h and 365d to ~17d — which would delete almost everything. The new test only uses 10 days so it doesn't surface this. Could you widen the multiply to 1000L * 3600 * 24 (and maybe add a test at a realistic retention like 30/90 days) so the event-time path is safe for common configs?

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

} else {
timeSegs[i] = seg;
}
if (PartitionPathEncodeUtils.DEFAULT_PARTITION_PATH.equals(timeSegs[i])) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This default-marker check runs per time segment inside the loop, so with deleteHiveDefaultPartition=true and a multi-segment time block it fires if any single segment is the default. For example dt=2026-06-28/hh=__HIVE_DEFAULT_PARTITION__ (date recent, only the hour column null) would be marked expired and deleted regardless of the recent date. Is dropping an otherwise-recent partition intended, or should the default-marker path only kick in when the whole time block is unresolvable?

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

return null;
}
try {
return java.time.Instant.from(parsed).toEpochMilli();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: java.time.Instant, java.time.LocalDate, and java.time.DateTimeException appear as fully-qualified names here while the closely related DateTimeParseException is already imported — could you add the three missing imports to the import block so the style is consistent throughout the file?

⚠️ AI-generated; verify before applying. React 👍/👎 to flag quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants