Skip to content

feat: metricsV2 + oTel + prometheus sample and Grafana dashboard#3154

Open
csviri wants to merge 67 commits intooperator-framework:nextfrom
csviri:otel-metrics-grafana
Open

feat: metricsV2 + oTel + prometheus sample and Grafana dashboard#3154
csviri wants to merge 67 commits intooperator-framework:nextfrom
csviri:otel-metrics-grafana

Conversation

@csviri
Copy link
Collaborator

@csviri csviri commented Feb 4, 2026

Goal of this PR is to provide a OTel + Prometheus + Grafana setup. So we:

  1. verify integration with OTel
  2. Provide a default Grafana dashboard for metrics.
  3. Adds a new operator and E2E test for h OTel + Prometheus + Grafana handling verification. This should be easy to done by a users, so dey can check and validate the dashboard.
  4. new metrics implementations

Notes on new metrics implementation:

  • only static gauges (lives until the operator lives) - that makes much easir to manage them
  • no manual counter removal (should not be done)
  • sanitized names and labels according to best practices
  • added some metrics counter for last attempt retries.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 4, 2026
@csviri csviri changed the title OTel+Prometheus sample and Graphana dashboard [WIP] OTel+Prometheus sample and Graphana dashboard Feb 4, 2026
@csviri csviri changed the title [WIP] OTel+Prometheus sample and Graphana dashboard [WIP] OTel+Prometheus sample and Grfana dashboard Feb 8, 2026
@csviri csviri changed the title [WIP] OTel+Prometheus sample and Grfana dashboard [WIP] OTel+Prometheus sample and Grafana dashboard Feb 8, 2026
@csviri csviri force-pushed the otel-metrics-grafana branch from c7e6ca2 to ece63e8 Compare February 8, 2026 15:25
@csviri
Copy link
Collaborator Author

csviri commented Feb 9, 2026

JVM metrics:

image image

@csviri
Copy link
Collaborator Author

csviri commented Feb 9, 2026

JODSK metrics:

image

Added TODOs to improve those, like we should not have controller name as suffix, rather as a tag:

operator_sdk_reconciliations_executions_webpagestandalonedependentsreconciler

@csviri csviri linked an issue Feb 10, 2026 that may be closed by this pull request
@csviri csviri requested review from metacosm and xstefank February 10, 2026 11:50
@csviri csviri changed the title [WIP] OTel+Prometheus sample and Grafana dashboard [WIP] MetricsV2 + OTel+Prometheus sample and Grafana dashboard Feb 11, 2026
@csviri csviri force-pushed the otel-metrics-grafana branch 2 times, most recently from 88c118c to cf9eb57 Compare February 21, 2026 12:06
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 23, 2026
@csviri csviri force-pushed the otel-metrics-grafana branch from 24f992f to 0438611 Compare February 27, 2026 09:52
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 27, 2026
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 1, 2026
@csviri csviri force-pushed the otel-metrics-grafana branch from 9014b2b to fe4ab6a Compare March 1, 2026 09:51
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 1, 2026
@csviri csviri changed the title [WIP] MetricsV2 + OTel+Prometheus sample and Grafana dashboard MetricsV2 + OTel+Prometheus sample and Grafana dashboard Mar 1, 2026
@csviri csviri self-assigned this Mar 1, 2026
@csviri csviri marked this pull request as ready for review March 1, 2026 10:45
Copilot AI review requested due to automatic review settings March 1, 2026 10:45
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 1, 2026
csviri and others added 13 commits March 3, 2026 08:15
…torsdk/operator/sample/metrics/MetricsHandlingSampleOperator.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
…tor/api/monitoring/Metrics.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
…operatorsdk/operator/sample/deployment.yaml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
@csviri csviri force-pushed the otel-metrics-grafana branch from d60193a to cf114e3 Compare March 3, 2026 07:15
| Meter name (Micrometer) | Type | Tags | Description |
|------------------------------------------|---------|-----------------------------------|----------------------------------------------------------------------|
| `reconciliations.executions` | gauge | `controller.name` | Number of reconciler executions currently in progress |
| `reconciliations.active` | gauge | `controller.name` | Number of resources currently queued for reconciliation |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "active" really a good name for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see below.

private static final String RECONCILIATIONS_STARTED = RECONCILIATIONS + "started" + TOTAL_SUFFIX;

private static final String RECONCILIATIONS_EXECUTIONS_GAUGE = RECONCILIATIONS + "executions";
private static final String RECONCILIATIONS_QUEUE_SIZE_GAUGE = RECONCILIATIONS + "active";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queue would be better name

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that queue is a bit misleading, because this is actually the queue + ongoing reconiliations, that is the reason I added active. (ExecutorService has a queue). Can expand the docs on this

Copy link
Collaborator Author

@csviri csviri Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually made an adjustment so we have now active and queue where queue is that are submitted to executor service but not started yet, the active is the number of actually running threads with reconiliation logic. This seems to be more intuitive and useful.

csviri and others added 7 commits March 3, 2026 15:28
- deleted redundant delete events
- adjusted metrics now we have queue and active reconciliations (with the intuitive seantics)

Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Chris Laprun <metacosm@gmail.com>
Signed-off-by: Chris Laprun <metacosm@gmail.com>
Signed-off-by: Chris Laprun <metacosm@gmail.com>
Signed-off-by: Chris Laprun <metacosm@gmail.com>
``` No newline at end of file
```

## Metrics interface changes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would technically be an API break and would require a new major version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strinctly sepaking yes, but such minor API changes we do some times, see the migration document. As other frameworks sometimes. It is basically I think a better choice in terms of tradeoff, since because we don't really want to increase the major verion that often and we on the other hand we have quite an amount of APIs, that sometimes better to evolve this way IMO.

I also was trying to do backwards compatible, we still could. But at the end it looked like that it would be more confusing, that just having a table to be able to easily migrate from current impl. If that makes sense.


# Check if helm is installed, download locally if not
echo -e "\n${YELLOW}Checking helm installation...${NC}"
if ! command -v helm &> /dev/null; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installing helm should be made optional. People might not want to have helm automatically installed or want to install it themselves some other way.

Copy link
Collaborator Author

@csviri csviri Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this does, it checks if helm command works (so already installed), if not, downloads it as a binary, but just locally in the project dir (so does not install it), and uses that binary directly. So this is not very invasive at the end IMO, usually it is done this way in such scripts.


\* `namespace` tag is only included when `withNamespaceAsTag()` is enabled.

The execution timer uses explicit SLO boundaries (10ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s, 10s, 30s) to ensure
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SLO acronym should be used expanded first before.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SLO I consider a well known term, but actually we can just remote that from here. done.

if (resourceEvent.getAction() == ResourceAction.ADDED) {
gauges.get(numberOfResourcesRefName(getControllerName(metadata))).incrementAndGet();
}
var namespace = resourceEvent.getRelatedCustomResourceID().getNamespace().orElse(null);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that be in sync with what's done for MDC, i.e. use a default value instead of null for clustered resources?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way at the end we don't add the namespace tag (if null), that is according the guidelines.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I'd argue that MDC should follow the same guidelines, otherwise it's just confusing to have telemetry report one thing while MDC does something else, especially if you're trying to correlate logs and telemetry.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why should we unify those completely unrelated systems.


private void incrementCounter(
String counterName, String namespace, Map<String, Object> metadata, Tag... additionalTags) {
final var tags = new ArrayList<Tag>(2 + additionalTags.length);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be a List but a Set.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is really an implementation detail IMO

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, was thinking about this, actually, should not be a Set: basically, what we really want is just a list of Tags in memory, that we later iterate through. While Set might create some secondary data structures like indexes for fast access, what we don't want in this case. So using list is a better both memory and mental model IMO :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want duplicated tags, nor do we care about their order, that's what set are for and that matches what people would expect. As a user, I don't want duplicated tags to show up, which is a possibility with the current implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you take the look on Micrometer API:

 public Counter counter(String name, Iterable<Tag> tags) {
        return this.counter(name, Tags.of(tags));
    }

    public Counter counter(String name, String... tags) {
        return this.counter(name, Tags.of(tags));
    }

    public Counter counter(String name, Tags tags) {
        return this.counter(new Meter.Id(name, tags, (String)null, (String)null, Type.COUNTER));
    }

Does not work with Sets either.

Set's usually maintains an addiitonal index in memory, we don't want to have that memory overhead. While adding an additional tag should not be our responsibility filter out, that is an usage error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with using Iterable. The semantics should not be a list, though, because this implies order, which does not exist.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They have also an array of tags api, that implies order... I don't think that is a bad approach; the other way to look at this, is that we just want to have a list of elemnts in memory (not a linked list, not a set, since those have memory overhead).

Is there order? yes
Do we care about the order? no

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But will take a look how it would look like with iterable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, checking, but those are 2 liner methods, internal APIs, there is no memory overhead. This is completely fine this way.


int retryNumber = retryInfo.map(RetryInfo::getAttemptCount).orElse(0);
if (retryNumber > 0) {
incrementCounter(RECONCILIATIONS_RETRIES_NUMBER, namespace, metadata);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the retry number get propagated? Seems like the counter that is incremented is completely decorrelated from the actual retry number?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that we measure the number of retries, so we increase this number if the actual reconiliation is a retry. What is interesting at the end in the metrics is the rate of the retries.

metacosm and others added 3 commits March 4, 2026 11:40
Signed-off-by: Chris Laprun <metacosm@gmail.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
@csviri csviri requested a review from metacosm March 4, 2026 20:25
*/
public static String getDefaultPluralFor(String kind) {
// todo: replace by Fabric8 version when available, see
// replace by Fabric8 version when available, see
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was the todo removed? This still needs to be done.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh should not be removed in this PR, will put it back.

Wanted to remove these TODOs for which we already have issues, so that when go through TODOs I see just relevant, in general it is a good practice to remove all the TODO's from the code before we merge and rather create issue, since it pollutes the code.

Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
@csviri csviri requested a review from metacosm March 5, 2026 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

5 participants