Skip to content

HDDS-15208. OM should learn to finalize from SCM#10236

Open
sodonnel wants to merge 36 commits into
apache:HDDS-14496-zdufrom
sodonnel:HDDS-15208
Open

HDDS-15208. OM should learn to finalize from SCM#10236
sodonnel wants to merge 36 commits into
apache:HDDS-14496-zdufrom
sodonnel:HDDS-15208

Conversation

@sodonnel
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

When OM is started, it checks if finalization is needed. If so, it spawns a background service which polls SCM to see if it is time to finalize or not. If SCM indicates it should finalize, then it will finalize the OMs via Ratis and then shutdown the background service to avoid any further polling.

The poll interval defaults to 60 seconds, but its configurable via ozone.om.upgrade.finalization.check.interval.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15208

How was this patch tested?

Existing integration tests were modified to remove the calls to the old OM finalize command, and instead trigger finalize via SCM. New unit tests have been added for the new polling service.

@github-actions github-actions Bot added the zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496 label May 11, 2026
@errose28 errose28 self-requested a review May 11, 2026 15:44
Copy link
Copy Markdown
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this. Overall looks good, just left some minor comments inline.

Copy link
Copy Markdown
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. My understanding is that this is being done in two PRs, where this one adds the OM server side handling of finalization and a follow-up PR will add the client command that starts the background service and writes the finalizing key. Can you update the PR description?

appender.console.name = STDOUT
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{DEFAULT} | %-5level | %c{1} | %msg | %throwable{3} %n
appender.console.layout.pattern = %-5level | %c{1} | %msg%n
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was the pattern for both audit loggers changed? Can we use the original pattern for both?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is removing the exception and timestamp from being logged to the console. IMO we should restore this to what it was before since it is easer to trace log messages in a test env from the console than opening dedicated files. additivity was also set to false for the audit logs which means they won't go to the console anymore. I think we should remove that so all log messages are searchable in one place.

Copy link
Copy Markdown
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two comments left. Looks like there's some new test failures to address as well.

appender.console.name = STDOUT
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{DEFAULT} | %-5level | %c{1} | %msg | %throwable{3} %n
appender.console.layout.pattern = %-5level | %c{1} | %msg%n
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is removing the exception and timestamp from being logged to the console. IMO we should restore this to what it was before since it is easer to trace log messages in a test env from the console than opening dedicated files. additivity was also set to false for the audit logs which means they won't go to the console anymore. I think we should remove that so all log messages are searchable in one place.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since request processing is mocked in this test, we should have a unit test for OMFinalizeUpgradeRequest/Response classes as well, similar to other OM requests. Looks like we didn't have those previously but they would be good to add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants