Skip to content

metrics,grafana: add ticdc_arch to build info#4266

Merged
ti-chi-bot[bot] merged 2 commits into
pingcap:masterfrom
wlwilliamx:metrics/add-ticdc-arch-build-info
Jun 8, 2026
Merged

metrics,grafana: add ticdc_arch to build info#4266
ti-chi-bot[bot] merged 2 commits into
pingcap:masterfrom
wlwilliamx:metrics/add-ticdc-arch-build-info

Conversation

@wlwilliamx

@wlwilliamx wlwilliamx commented Feb 25, 2026

Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

Issue Number: close #4265

What is changed and how it works?

  • Add ticdc_arch label to ticdc_server_build_info.
  • Set ticdc_arch="newarch" when TiCDC runs in new architecture mode.
  • Update the Grafana "Build Info" panel to show ticdc_arch. If an instance does not expose ticdc_server_build_info (old arch), the panel falls back to ticdc_server_etcd_health_check_duration_count and shows ticdc_arch="oldarch".

Check List

Tests

  • Manual test
CleanShot 2026-02-24 at 19 23 10@2x

Questions

Will it cause performance regression or break compatibility?

No. The change adds one constant label to an existing build info gauge metric and only affects dashboard queries.

Do you need to update user documentation, design documentation or monitoring documentation?

Monitoring dashboard is updated in this PR.

Release note

Expose TiCDC architecture mode (newarch/oldarch) in build info metric and dashboard.

Summary by CodeRabbit

  • New Features

    • Added an architecture dimension to build info metrics so builds can be identified and monitored by architecture.
  • Monitoring

    • Grafana dashboards updated to surface the new architecture field in Build Info panels, adjust field ordering, and include fallback aggregation to ensure architecture visibility across instances.

@ti-chi-bot ti-chi-bot Bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 25, 2026
@coderabbitai

coderabbitai Bot commented Feb 25, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 60734374-293d-4ea0-b6af-a028f41d6a71

📥 Commits

Reviewing files that changed from the base of the PR and between 509a59c and a6f8051.

📒 Files selected for processing (3)
  • cmd/cdc/server/server.go
  • metrics/grafana/ticdc_new_arch.json
  • metrics/nextgengrafana/ticdc_new_arch_next_gen.json
💤 Files with no reviewable changes (2)
  • metrics/grafana/ticdc_new_arch.json
  • metrics/nextgengrafana/ticdc_new_arch_next_gen.json
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/cdc/server/server.go

📝 Walkthrough

Walkthrough

Adds a new ticdc_arch label to ticdc_server_build_info, sets it to "newarch" at server startup, and updates Grafana dashboard queries and organize transformations to include and surface the ticdc_arch field.

Changes

BuildInfo + Dashboards

Layer / File(s) Summary
Server instrumentation
cmd/cdc/server/server.go
Passes "newarch" as the new ticdc_arch label value when setting metrics.BuildInfo on startup.
Grafana (classic) query & transform
metrics/grafana/ticdc_new_arch.json
PromQL max by aggregation now includes ticdc_arch; added label_replace fallback branch to populate ticdc_arch; updated organize.options.indexByName to add ticdc_arch and shift existing indices.
Grafana (next-gen) query & transform
metrics/nextgengrafana/ticdc_new_arch_next_gen.json
Same PromQL and organize.indexByName changes as the classic dashboard: include ticdc_arch in max-by and add the new field index, shifting other field positions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through code at break of dawn,
A tiny label now is born.
Prometheus scrapes, Grafana shows,
Newarch or old—now everyone knows!

sequenceDiagram
  participant TiCDC_Server as TiCDC Server
  participant Prometheus as Prometheus
  participant Grafana as Grafana
  TiCDC_Server->>Prometheus: expose metric `ticdc_server_build_info{..., ticdc_arch="newarch"}`
  Prometheus->>Grafana: scrape & store metrics
  Grafana->>Prometheus: query PromQL including `ticdc_arch` (max by instance,ticdc_arch,...)
  Grafana->>Grafana: organize transformation maps `ticdc_arch` into table columns
Loading
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'metrics,grafana: add ticdc_arch to build info' directly and concisely describes the main change: adding a ticdc_arch label to build info metrics and updating Grafana dashboards accordingly.
Description check ✅ Passed The PR description follows the template with all required sections: Issue Number linking to #4265, detailed explanation of changes, test evidence included, compatibility assessment, and release note provided.
Linked Issues check ✅ Passed All coding objectives from issue #4265 are met: ticdc_arch label added to ticdc_server_build_info with newarch/oldarch values, Grafana dashboards updated with fallback for old-arch instances, and backward compatibility maintained.
Out of Scope Changes check ✅ Passed All changes are scoped to the stated objectives: metric label additions, server code changes, and Grafana dashboard updates. No unrelated modifications detected in the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 25, 2026
@gemini-code-assist

Copy link
Copy Markdown

Summary of Changes

Hello @wlwilliamx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the observability of TiCDC instances by introducing a new metric label, ticdc_arch, which identifies the architecture mode (new or old) of a running TiCDC server. This information is then integrated into the Grafana monitoring dashboards, providing users with clearer insights into the operational state and architecture of their TiCDC deployments. The change ensures that even older instances are represented, enhancing overall monitoring consistency.

Highlights

  • Metric Enhancement: The ticdc_server_build_info metric now includes a ticdc_arch label to indicate whether TiCDC is running in 'newarch' or 'oldarch' mode.
  • Grafana Dashboard Update: The 'Build Info' panel in Grafana dashboards has been updated to display the ticdc_arch label. It also includes fallback logic to infer 'oldarch' for instances that do not expose the new ticdc_server_build_info metric.
  • Code Instrumentation: The TiCDC server initialization now explicitly sets the ticdc_arch label to 'newarch' when reporting build information.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • cmd/cdc/server/server.go
    • Added the ticdc_arch label with the value "newarch" to the metrics.BuildInfo metric when reporting build information.
  • metrics/grafana/ticdc_new_arch.json
    • Modified the Prometheus query for the "Build Info" panel to include ticdc_arch in the max by clause.
    • Added an or clause to the query to provide fallback build info for instances that do not expose ticdc_server_build_info, labeling them as ticdc_arch="oldarch".
    • Adjusted the indexByName mapping to account for the new ticdc_arch label.
  • metrics/nextgengrafana/ticdc_new_arch_next_gen.json
    • Modified the Prometheus query for the "Build Info" panel to include ticdc_arch in the max by clause.
    • Added an or clause to the query to provide fallback build info for instances that do not expose ticdc_server_build_info, labeling them as ticdc_arch="oldarch".
    • Adjusted the indexByName mapping to account for the new ticdc_arch label.
  • pkg/metrics/server.go
    • Updated the BuildInfo Prometheus GaugeVec definition to include "ticdc_arch" as a new label.
    • Updated the comment for BuildInfo to reflect the inclusion of architecture mode.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a ticdc_arch label to the ticdc_server_build_info metric to distinguish between new and old architecture TiCDC instances. The changes correctly update the metric definition and set the label value. The Grafana dashboards are also updated to display this new information, with a fallback mechanism for older instances that don't report this metric. While the implementation is functionally correct, the PromQL queries in the Grafana dashboards have become quite complex and hard to maintain. I've added specific comments with suggestions to improve the maintainability of these queries.

"targets": [
{
"expr": "max by (instance, kernel_type, git_hash, release_version, utc_build_time) (ticdc_server_build_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"})",
"expr": "max by (instance, ticdc_arch, kernel_type, git_hash, release_version, utc_build_time) (ticdc_server_build_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"}) or (max by (instance, ticdc_arch, kernel_type, git_hash, release_version, utc_build_time) (label_replace(label_replace(label_replace(label_replace(label_replace(ticdc_server_etcd_health_check_duration_count{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"}, \"ticdc_arch\", \"oldarch\", \"instance\", \".*\"), \"kernel_type\", \"unknown\", \"instance\", \".*\"), \"git_hash\", \"unknown\", \"instance\", \".*\"), \"release_version\", \"unknown\", \"instance\", \".*\"), \"utc_build_time\", \"unknown\", \"instance\", \".*\"))) unless on (instance) ticdc_server_build_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"})",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new PromQL query is functionally correct for providing a fallback for older TiCDC versions. However, its length and the deeply nested label_replace calls make it difficult to read and maintain.

For improved readability, you could consider using a recording rule in Prometheus to pre-calculate the fallback metric with the necessary labels. This would simplify the dashboard query significantly.

If a recording rule is not feasible, adding comments within the Grafana panel's description explaining the query logic would be helpful for future maintenance.

"targets": [
{
"expr": "max by (instance, kernel_type, git_hash, release_version, utc_build_time) (ticdc_server_build_info{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"})",
"expr": "max by (instance, ticdc_arch, kernel_type, git_hash, release_version, utc_build_time) (ticdc_server_build_info{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"}) or (max by (instance, ticdc_arch, kernel_type, git_hash, release_version, utc_build_time) (label_replace(label_replace(label_replace(label_replace(label_replace(ticdc_server_etcd_health_check_duration_count{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"}, \"ticdc_arch\", \"oldarch\", \"instance\", \".*\"), \"kernel_type\", \"unknown\", \"instance\", \".*\"), \"git_hash\", \"unknown\", \"instance\", \".*\"), \"release_version\", \"unknown\", \"instance\", \".*\"), \"utc_build_time\", \"unknown\", \"instance\", \".*\"))) unless on (instance) ticdc_server_build_info{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"})",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the other dashboard file, this PromQL query is very long and complex due to the nested label_replace functions for fallback support. This impacts readability and maintainability.

Consider using a Prometheus recording rule to abstract away this complexity. A recording rule could generate a clean metric for old-architecture nodes, which would make this dashboard query much simpler.

If that's not an option, at least a comment in the panel description explaining the query would be beneficial.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cmd/cdc/server/server.go (1)

126-126: Consider extracting "newarch" to a named constant.

The raw string literal "newarch" is embedded directly in the WithLabelValues call. If this value is ever referenced or compared elsewhere (e.g., in tests, dashboard query validation, or a future "oldarch" metric path), having it as an unshared literal creates a maintenance risk.

♻️ Proposed refactor

Define a constant — for example in pkg/metrics or a shared location — and reference it here:

+// In pkg/metrics/server.go (or a suitable shared constants file):
+const (
+    TiCDCArchNewArch = "newarch"
+    TiCDCArchOldArch = "oldarch"
+)
-metrics.BuildInfo.WithLabelValues(version.ReleaseVersion, version.GitHash, version.BuildTS, kerneltype.Name(), "newarch").Set(1)
+metrics.BuildInfo.WithLabelValues(version.ReleaseVersion, version.GitHash, version.BuildTS, kerneltype.Name(), metrics.TiCDCArchNewArch).Set(1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cdc/server/server.go` at line 126, Extract the literal "newarch" into a
well-named constant (e.g., const ArchNew = "newarch") and use that constant in
the metrics.BuildInfo.WithLabelValues call to avoid magic strings; define the
constant in a shared place such as the pkg/metrics package (or another common
package used by cmd/cdc/server), then replace the direct literal in server.go
(the metrics.BuildInfo.WithLabelValues(...) invocation) with the constant
reference.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cmd/cdc/server/server.go`:
- Line 126: Extract the literal "newarch" into a well-named constant (e.g.,
const ArchNew = "newarch") and use that constant in the
metrics.BuildInfo.WithLabelValues call to avoid magic strings; define the
constant in a shared place such as the pkg/metrics package (or another common
package used by cmd/cdc/server), then replace the direct literal in server.go
(the metrics.BuildInfo.WithLabelValues(...) invocation) with the constant
reference.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 75e291b and 509a59c.

📒 Files selected for processing (4)
  • cmd/cdc/server/server.go
  • metrics/grafana/ticdc_new_arch.json
  • metrics/nextgengrafana/ticdc_new_arch_next_gen.json
  • pkg/metrics/server.go

@ti-chi-bot ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Jun 8, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lidezhu, wk989898

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 8, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown

[LGTM Timeline notifier]

Timeline:

  • 2026-06-08 02:44:12.502716658 +0000 UTC m=+755153.573034058: ☑️ agreed by wk989898.
  • 2026-06-08 02:45:47.721391054 +0000 UTC m=+755248.791708464: ☑️ agreed by lidezhu.

@wlwilliamx

Copy link
Copy Markdown
Collaborator Author

/test all

@ti-chi-bot ti-chi-bot Bot merged commit 8dbf481 into pingcap:master Jun 8, 2026
26 checks passed
@wlwilliamx wlwilliamx added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Jun 11, 2026
@ti-chi-bot

Copy link
Copy Markdown
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #5329.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

metrics: show TiCDC architecture mode in build info

4 participants