feat: metricsV2 + oTel + prometheus sample and Grafana dashboard#3154
feat: metricsV2 + oTel + prometheus sample and Grafana dashboard#3154csviri wants to merge 67 commits intooperator-framework:nextfrom
Conversation
c7e6ca2 to
ece63e8
Compare
88c118c to
cf9eb57
Compare
24f992f to
0438611
Compare
9014b2b to
fe4ab6a
Compare
…torsdk/operator/sample/metrics/MetricsHandlingSampleOperator.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
…tor/api/monitoring/Metrics.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
…operatorsdk/operator/sample/deployment.yaml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
d60193a to
cf114e3
Compare
| | Meter name (Micrometer) | Type | Tags | Description | | ||
| |------------------------------------------|---------|-----------------------------------|----------------------------------------------------------------------| | ||
| | `reconciliations.executions` | gauge | `controller.name` | Number of reconciler executions currently in progress | | ||
| | `reconciliations.active` | gauge | `controller.name` | Number of resources currently queued for reconciliation | |
There was a problem hiding this comment.
is "active" really a good name for this?
| private static final String RECONCILIATIONS_STARTED = RECONCILIATIONS + "started" + TOTAL_SUFFIX; | ||
|
|
||
| private static final String RECONCILIATIONS_EXECUTIONS_GAUGE = RECONCILIATIONS + "executions"; | ||
| private static final String RECONCILIATIONS_QUEUE_SIZE_GAUGE = RECONCILIATIONS + "active"; |
There was a problem hiding this comment.
queue would be better name
There was a problem hiding this comment.
The thing is that queue is a bit misleading, because this is actually the queue + ongoing reconiliations, that is the reason I added active. (ExecutorService has a queue). Can expand the docs on this
There was a problem hiding this comment.
Actually made an adjustment so we have now active and queue where queue is that are submitted to executor service but not started yet, the active is the number of actually running threads with reconiliation logic. This seems to be more intuitive and useful.
- deleted redundant delete events - adjusted metrics now we have queue and active reconciliations (with the intuitive seantics) Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Chris Laprun <metacosm@gmail.com>
Signed-off-by: Chris Laprun <metacosm@gmail.com>
Signed-off-by: Chris Laprun <metacosm@gmail.com>
Signed-off-by: Chris Laprun <metacosm@gmail.com>
| ``` No newline at end of file | ||
| ``` | ||
|
|
||
| ## Metrics interface changes |
There was a problem hiding this comment.
This would technically be an API break and would require a new major version.
There was a problem hiding this comment.
Strinctly sepaking yes, but such minor API changes we do some times, see the migration document. As other frameworks sometimes. It is basically I think a better choice in terms of tradeoff, since because we don't really want to increase the major verion that often and we on the other hand we have quite an amount of APIs, that sometimes better to evolve this way IMO.
I also was trying to do backwards compatible, we still could. But at the end it looked like that it would be more confusing, that just having a table to be able to easily migrate from current impl. If that makes sense.
|
|
||
| # Check if helm is installed, download locally if not | ||
| echo -e "\n${YELLOW}Checking helm installation...${NC}" | ||
| if ! command -v helm &> /dev/null; then |
There was a problem hiding this comment.
Installing helm should be made optional. People might not want to have helm automatically installed or want to install it themselves some other way.
There was a problem hiding this comment.
What this does, it checks if helm command works (so already installed), if not, downloads it as a binary, but just locally in the project dir (so does not install it), and uses that binary directly. So this is not very invasive at the end IMO, usually it is done this way in such scripts.
...ort/src/main/java/io/javaoperatorsdk/operator/monitoring/micrometer/MicrometerMetricsV2.java
Outdated
Show resolved
Hide resolved
|
|
||
| \* `namespace` tag is only included when `withNamespaceAsTag()` is enabled. | ||
|
|
||
| The execution timer uses explicit SLO boundaries (10ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s, 10s, 30s) to ensure |
There was a problem hiding this comment.
SLO acronym should be used expanded first before.
There was a problem hiding this comment.
SLO I consider a well known term, but actually we can just remote that from here. done.
| if (resourceEvent.getAction() == ResourceAction.ADDED) { | ||
| gauges.get(numberOfResourcesRefName(getControllerName(metadata))).incrementAndGet(); | ||
| } | ||
| var namespace = resourceEvent.getRelatedCustomResourceID().getNamespace().orElse(null); |
There was a problem hiding this comment.
Shouldn't that be in sync with what's done for MDC, i.e. use a default value instead of null for clustered resources?
There was a problem hiding this comment.
This way at the end we don't add the namespace tag (if null), that is according the guidelines.
There was a problem hiding this comment.
Then I'd argue that MDC should follow the same guidelines, otherwise it's just confusing to have telemetry report one thing while MDC does something else, especially if you're trying to correlate logs and telemetry.
There was a problem hiding this comment.
I don't see why should we unify those completely unrelated systems.
|
|
||
| private void incrementCounter( | ||
| String counterName, String namespace, Map<String, Object> metadata, Tag... additionalTags) { | ||
| final var tags = new ArrayList<Tag>(2 + additionalTags.length); |
There was a problem hiding this comment.
This shouldn't be a List but a Set.
There was a problem hiding this comment.
That is really an implementation detail IMO
There was a problem hiding this comment.
Hmm, was thinking about this, actually, should not be a Set: basically, what we really want is just a list of Tags in memory, that we later iterate through. While Set might create some secondary data structures like indexes for fast access, what we don't want in this case. So using list is a better both memory and mental model IMO :)
There was a problem hiding this comment.
We don't want duplicated tags, nor do we care about their order, that's what set are for and that matches what people would expect. As a user, I don't want duplicated tags to show up, which is a possibility with the current implementation.
There was a problem hiding this comment.
If you take the look on Micrometer API:
public Counter counter(String name, Iterable<Tag> tags) {
return this.counter(name, Tags.of(tags));
}
public Counter counter(String name, String... tags) {
return this.counter(name, Tags.of(tags));
}
public Counter counter(String name, Tags tags) {
return this.counter(new Meter.Id(name, tags, (String)null, (String)null, Type.COUNTER));
}Does not work with Sets either.
Set's usually maintains an addiitonal index in memory, we don't want to have that memory overhead. While adding an additional tag should not be our responsibility filter out, that is an usage error.
There was a problem hiding this comment.
I'm fine with using Iterable. The semantics should not be a list, though, because this implies order, which does not exist.
There was a problem hiding this comment.
They have also an array of tags api, that implies order... I don't think that is a bad approach; the other way to look at this, is that we just want to have a list of elemnts in memory (not a linked list, not a set, since those have memory overhead).
Is there order? yes
Do we care about the order? no
There was a problem hiding this comment.
But will take a look how it would look like with iterable.
There was a problem hiding this comment.
Hmm, checking, but those are 2 liner methods, internal APIs, there is no memory overhead. This is completely fine this way.
...ort/src/main/java/io/javaoperatorsdk/operator/monitoring/micrometer/MicrometerMetricsV2.java
Outdated
Show resolved
Hide resolved
|
|
||
| int retryNumber = retryInfo.map(RetryInfo::getAttemptCount).orElse(0); | ||
| if (retryNumber > 0) { | ||
| incrementCounter(RECONCILIATIONS_RETRIES_NUMBER, namespace, metadata); |
There was a problem hiding this comment.
How does the retry number get propagated? Seems like the counter that is incremented is completely decorrelated from the actual retry number?
There was a problem hiding this comment.
The idea is that we measure the number of retries, so we increase this number if the actual reconiliation is a retry. What is interesting at the end in the metrics is the rate of the retries.
Signed-off-by: Chris Laprun <metacosm@gmail.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
| */ | ||
| public static String getDefaultPluralFor(String kind) { | ||
| // todo: replace by Fabric8 version when available, see | ||
| // replace by Fabric8 version when available, see |
There was a problem hiding this comment.
Why was the todo removed? This still needs to be done.
There was a problem hiding this comment.
Ahh should not be removed in this PR, will put it back.
Wanted to remove these TODOs for which we already have issues, so that when go through TODOs I see just relevant, in general it is a good practice to remove all the TODO's from the code before we merge and rather create issue, since it pollutes the code.



Goal of this PR is to provide a OTel + Prometheus + Grafana setup. So we:
Notes on new metrics implementation: