Skip to content

Obsolete resource handling for read-cache-after-write#3207

Open
csviri wants to merge 33 commits intooperator-framework:nextfrom
csviri:obsolete-resource-cleanup
Open

Obsolete resource handling for read-cache-after-write#3207
csviri wants to merge 33 commits intooperator-framework:nextfrom
csviri:obsolete-resource-cleanup

Conversation

@csviri
Copy link
Collaborator

@csviri csviri commented Mar 8, 2026

  • Handles obsolete resources
  • Adds a new scheduled executor service to share between those (later we can use this also for TimerEventSource)

Signed-off-by: Attila Mészáros a_meszaros@apple.com

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 8, 2026
@csviri csviri changed the title Obsolete resource handling for read-cache-after-write [WIP] Obsolete resource handling for read-cache-after-write Mar 8, 2026
@csviri csviri linked an issue Mar 8, 2026 that may be closed by this pull request
@shawkins
Copy link
Collaborator

shawkins commented Mar 8, 2026

edit: started my initial review with seeing this had been associated with #3208. Based upon that description this is for when a resource is added to the temporary cache, then deleted by another actor, and a relist happens before the either the new or the delete events are received.

Those stale entries won't actually get used because of the logic in ManagedInformerEventSource.get, but if this happens a sufficient number of times for resources that never again receive an event it can be a memory leak.

In addition to something like this, ManagedInformerEventSource.get could be refined.

in temporary in cache outcome
yes yes temporary used if later, but the stale entry is not proactively removed otherwise
yes no nothing returned - but could reason that this is a creation in temporary, not yet in the cache (using the caches last resource version)
no yes cache is used - but could reason over deletion in temporary
no no nothing returned

@csviri
Copy link
Collaborator Author

csviri commented Mar 8, 2026

Those stale entries won't actually get used because of the logic in ManagedInformerEventSource.get, but if this happens a sufficient number of times for resources that never again receive an event it can be a memory leak.

I don't see why those would not be used:

Optional<R> resource = temporaryResourceCache.getResourceFromCache(resourceID);
var res = cache.get(resourceID);
if (comparableResourceVersions
&& resource.isPresent()
&& res.filter(
r -> ReconcilerUtilsInternal.compareResourceVersions(r, resource.orElseThrow()) > 0)
.isEmpty()) {
log.debug("Latest resource found in temporary cache for Resource ID: {}", resourceID);
return resource;
}

If there is a resource in TemporaryResourceCache (TRC), it will return that value.

This PR ensures that as you mentioned that we miss the delete event because of re-list, we eventually remove those resources.

The related PR in fabric8 client will allow us to ensure that we always get an up-to-date resource version (vi bookmark), even if there are no further resources changes related to this informer. So we can clear old these old resources from TRC.

temporary used if later, but the stale entry is not proactively removed otherwise.

Stale entries are removed as a result of an event. This PR ensures that also those for which we miss the DELETE event are removed too. Not sure why we would not proactively remote those?

@shawkins
Copy link
Collaborator

shawkins commented Mar 8, 2026

If there is a resource in TemporaryResourceCache (TRC), it will return that value.

edit: I was initially reading that effectively as isPresent, not isEmpty. So if the entry doesn't exist, we'll still use the temporary version. Yes, that logic should change to not even look for the cache entry, but to test the lastest resource version known to the informer.

Stale entries are removed as a result of an event. This PR ensures that also those for which we miss the DELETE event are removed too. Not sure why we would not proactively remote those?

That goes back to the original form of the comment, if you were trying to keep the temporary cache as small as possible - if so, then anytime we determine an entry is stale, there's no reason not to remove it.

@csviri
Copy link
Collaborator Author

csviri commented Mar 8, 2026

Yes, that logic should change to not even look for the cache entry, but to test the lastest resource version known to the informer.

you mean the lastSyncVersion or latestResourceVersion from TRC ? hmm but I guess that does not matter that much in this case

@shawkins
Copy link
Collaborator

shawkins commented Mar 8, 2026

you mean the lastSyncVersion or latestResourceVersion from TRC ? hmm but I guess that does not matter that much in this case

I mean the cache lastSyncResourceVersion - that would allow any relist (regardless of whether list watches are used, and any other client changes) to make the TRC resource seem obsolete.

@csviri
Copy link
Collaborator Author

csviri commented Mar 8, 2026

Yep, will change it also in this pr. Thank you.

We should still clear
the cache as in this PR imo.

Also if we can get the notification on the list from the client will cover that last corner case if there are no further events coming in. But frankly I think we can release without that after this change since the probability that all these circumstances happens is really close to zero.

@shawkins
Copy link
Collaborator

shawkins commented Mar 9, 2026

Also if we can get the notification on the list from the client will cover that last corner case if there are no further events coming in. But frankly I think we can release without that after this change since the probability that all these circumstances happens is really close to zero.

No it shouldn't hold up the release. You could even consider changing what you have here to be based upon a timer, and you'll be covered regardless.

@csviri csviri requested review from metacosm, shawkins and xstefank March 9, 2026 12:58
@csviri csviri force-pushed the obsolete-resource-cleanup branch from c074fcd to fb12deb Compare March 9, 2026 15:20
csviri added 12 commits March 9, 2026 17:47
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
@csviri csviri force-pushed the obsolete-resource-cleanup branch from fb12deb to 5421153 Compare March 9, 2026 16:47
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
csviri added 3 commits March 10, 2026 09:07
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
@csviri csviri changed the title [WIP] Obsolete resource handling for read-cache-after-write Obsolete resource handling for read-cache-after-write Mar 10, 2026
csviri added 3 commits March 10, 2026 09:41
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
@shawkins
Copy link
Collaborator

Not sure if the obsolete is the good word to use here (I mean across the code), but stale did not sound strong enough. If you have better idea for naming pls don't hold it back @shawkins :)

Mentioning consistency or obsolete seems to be good. You could refer to the pruning as dealing with "ghost" entries.

Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
@csviri csviri marked this pull request as ready for review March 10, 2026 13:07
Copilot AI review requested due to automatic review settings March 10, 2026 13:07
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 10, 2026
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
@csviri
Copy link
Collaborator Author

csviri commented Mar 10, 2026

@shawkins @xstefank @metacosm pls take a look if this makes sense for you this way. Should be fairly polished now.

@csviri
Copy link
Collaborator Author

csviri commented Mar 10, 2026

Yes, maybe ghost resource is better name.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the framework’s read-cache-after-write behavior by introducing periodic cleanup of “obsolete” entries in the temporary resource cache, and wiring the cleanup interval into informer configuration.

Changes:

  • Add scheduled obsolete-entry detection/removal to TemporaryResourceCache, driven by informer lastSyncResourceVersion().
  • Extend informer configuration/annotation to support an obsoleteResourceCacheCheckInterval with a default.
  • Update tests and sample logging configs; remove an unused sample dependency.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
operator-framework-core/src/main/java/.../TemporaryResourceCache.java Adds scheduled obsolete-resource checking and uses informer sync RV for consistency decisions.
operator-framework-core/src/main/java/.../ManagedInformerEventSource.java Constructs the updated temporary cache and uses sync RV in get(...) logic.
operator-framework-core/src/main/java/.../InformerManager.java Exposes lastSyncResourceVersion(namespace) to support new cache logic.
operator-framework-core/src/main/java/.../ExecutorServiceManager.java Adds a scheduled executor accessor/initialization for periodic tasks.
operator-framework-core/src/main/java/.../InformerConfiguration.java Adds config field + defaults for obsolete check interval.
operator-framework-core/src/main/java/.../InformerEventSourceConfiguration.java Moves comparable-RV configuration into InformerConfiguration and adds new interval builder method.
operator-framework-core/src/main/java/.../Informer.java Adds annotation attribute for obsolete check interval.
operator-framework-core/src/test/java/.../TemporaryPrimaryResourceCacheTest.java Adds tests for obsolete-resource removal behavior.
operator-framework-core/src/test/java/.../InformerEventSourceTest.java Updates mocks to include new configuration.
operator-framework-core/src/test/java/.../ControllerEventSourceTest.java Updates config setup for new interval plumbing.
operator-framework-core/src/main/java/.../InformerWrapper.java Adds getter exposing the underlying informer (used for sync RV retrieval).
sample-operators/.../log4j2.xml Adjusts logging levels/layouts in samples/tests.
sample-operators/leader-election/pom.xml Removes org.takes:takes dependency.
test.sh Adds a helper script to build/load a Docker image into kind.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +237 to +251
var iterator = cache.entrySet().iterator();
while (iterator.hasNext()) {
var e = iterator.next();
if (ReconcilerUtilsInternal.compareResourceVersions(
e.getValue().getMetadata().getResourceVersion(),
getLatestResourceVersion(e.getValue().getMetadata().getNamespace()))
< 0
// making sure we have the situation where resource is missing from the cache
&& managedInformerEventSource
.manager()
.get(ResourceID.fromResource(e.getValue()))
.isEmpty()) {
iterator.remove();
managedInformerEventSource.handleEvent(ResourceAction.DELETED, e.getValue(), null, true);
log.debug("Removing obsolete resource with ID: {}", e.getKey());
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkObsoleteResources() iterates over a ConcurrentHashMap and calls iterator.remove(), but ConcurrentHashMap iterators don’t support element removal and will throw UnsupportedOperationException at runtime. Remove entries via cache.remove(key) / cache.remove(key, value) (or collect keys to remove) instead of using iterator.remove().

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true.

Comment on lines 135 to 141
public void start(ConfigurationService configurationService) {
if (!started) {
this.configurationService = configurationService; // used to lazy init workflow executor
this.cachingExecutorService = Executors.newCachedThreadPool();
this.scheduledExecutorService = Executors.newScheduledThreadPool(0);
this.executor = new InstrumentedExecutorService(configurationService.getExecutorService());
started = true;
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scheduledExecutorService is initialized with Executors.newScheduledThreadPool(0), which creates a scheduler with 0 core threads; scheduled tasks won’t run. This also isn’t included in stop() shutdown, so it would leak threads once fixed. Use a scheduler with at least 1 thread (or a shared scheduler) and ensure it’s shut down in stop() alongside the other executors.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true.

Comment on lines +146 to +149
* For read-cache-after-write consistency there are some corner cases where we need to check the
* caches see {@link TemporaryResourceCache#checkObsoleteResources()} periodically. This is the
* period in milliseconds. Applicable only if {@link #comparableResourceVersions()}} is true.
*
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc has a mismatched brace in the link ({@link #comparableResourceVersions()}}) which will render incorrectly and can produce Javadoc warnings. Also, linking to TemporaryResourceCache#checkObsoleteResources() references a private method, which may not be linkable in generated docs; consider linking to the class or describing the behavior without referencing the private method.

Copilot uses AI. Check for mistakes.
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.


You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +233 to +242
private void checkGhostResources() {
log.debug("Checking for ghost resources.");
var iterator = cache.entrySet().iterator();
while (iterator.hasNext()) {
var e = iterator.next();
if (ReconcilerUtilsInternal.compareResourceVersions(
e.getValue().getMetadata().getResourceVersion(),
getLatestResourceVersion(e.getValue().getMetadata().getNamespace()))
< 0
// making sure we have the situation where resource is missing from the cache
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkGhostResources() calls compareResourceVersions(..., getLatestResourceVersion(...)) without handling the case where lastSyncResourceVersion() is null (e.g., informer not synced yet) or throws. This can cause a NullPointerException/NoSuchElementException and will cancel the periodic task (ScheduledExecutorService suppresses subsequent executions after an exception). Guard against null/absent latest RV and catch/log exceptions so the periodic cleanup keeps running.

Copilot uses AI. Check for mistakes.
csviri added 2 commits March 10, 2026 14:40
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.


You can also share your feedback on Copilot code review. Take the survey.

csviri added 3 commits March 10, 2026 15:14
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Edge case for instant delete of a new resource when informer down

3 participants