Skip to content

Commit 72dfcdc

Browse files
committed
prometheus monitoring
- prometheus metric - prometheus configuration file - prometheus metric retrieval - prometheus exporter - prometheus push gateway - prometheus alert manager - prometheus data storage - prometheus promQL fdfd
1 parent 96fdf1f commit 72dfcdc

File tree

2 files changed

+108
-0
lines changed

2 files changed

+108
-0
lines changed

content/Tools/Prometheus.md

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
---
2+
Author:
3+
- Xinyang YU
4+
Author Profile:
5+
- https://linkedin.com/in/xinyang-yu
6+
tags:
7+
- devops
8+
Creation Date: 2024-11-06, 14:39
9+
Last Date: 2024-11-07T16:51:48+08:00
10+
References:
11+
draft:
12+
description:
13+
---
14+
## Abstract
15+
---
16+
- The **standard monitoring tool** used in **container and microservices infrastructure**
17+
18+
>[!question] Why we need Prometheus?
19+
> **Hundreds of interconnected processes** serve users in these environments. If one process fails, it can trigger a **cascade of failures in other processes**, making it **difficult to pinpoint the root cause** from the end user's perspective. Prometheus provides visibility into the application and infrastructure of these hundreds of processes, **enabling proactive identification of issues instead of reactive debugging**. With Prometheus, we can continuously monitor processes, receive alerts about crashes, and even configure alerts for predefined thresholds.
20+
21+
### Prometheus Metric
22+
- **HELP attribute:** Provides a clear description of the metric's purpose
23+
- **TYPE attribute:** Defines the metric's type. Prometheus supports four core metric types:
24+
- **Counter:** A cumulative metric that represents a single monotonically increasing counter. Its value can only increase or be reset to zero on restart. Examples: the number of requests served, tasks completed, or errors encountered
25+
- **Gauge:** A metric that represents a single numerical value that can arbitrarily go up and down. Examples: current memory usage, temperature, or the number of active processes
26+
- **Histogram:** Samples observations (like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values
27+
- **Summary?**: Similar to histograms, but they calculate quantiles on the client-side and expose them directly. Useful for tracking percentiles of observations like request latencies
28+
29+
### Prometheus Configuration File
30+
```yaml
31+
global:
32+
scrape_interval: 15s
33+
evaluation_interval: 15s
34+
35+
rule_files:
36+
- "first.rules"
37+
- "second.rules"
38+
39+
scrape_configs:
40+
- job_name: "prometheus"
41+
static_configs:
42+
- targets: ['localhost:9090']
43+
- job_name: node_exporter
44+
scrape_interval: 1m
45+
scrape_timeout: 1m
46+
static_configs:
47+
- targets: ['localhost:9100']
48+
```
49+
50+
51+
- Under `global`, `scrape_interval: 15s` instructs the Prometheus server to **scrape target endpoints** every **15 seconds**. `evaluation_interval: 15s` runs the rules defined in `rules_files` every **15 seconds**, potentially triggering alerts in the [[#Prometheus Alert Manager|Prometheus Alert Manager]]
52+
- Scrape job details are specified under `scrape_configs`, allowing you to **override global settings** like `scrape_interval`
53+
54+
>[!important] Default setting for each scrape job
55+
> We have default settings like `metrics_path: "/metrics"` and `scheme: "http"` for each job.
56+
57+
## Prometheus Architecture
58+
---
59+
![[prometheus_architecture.gif]]
60+
61+
### Prometheus Metric Retrieval
62+
- The Prometheus server has a **metrics retrieval** component that **pulls data** from different **targets**, which are essentially various processes. This data consists of **units** such as CPU status, exception count, and request count. These units are considered [[#Prometheus Metric]] when monitored by Prometheus. The metrics are then stored in the [[#Prometheus Data Storage|time-series database (TSDB)]]
63+
- Prometheus also includes an HTTP server component that accepts PromQL queries, enabling integration with visualisation tools like Grafana
64+
65+
>[!important] A pulling system
66+
> Common monitoring services like AWS CloudWatch and [[Datadog]] use a **pushing** approach, where all processes push metrics to a centralized collection platform. This can create a **high load of network traffic**, and monitoring can become a bottleneck. Additionally, we need to install a daemon on all processes to push metrics, while Prometheus only needs a **scraping endpoint**. The pulling approach is also a convenient way to check if a process is up.
67+
68+
>[!question] How is Prometheus metric pulled?
69+
> Prometheus metrics are pulled by accessing an **HTTP endpoint on the target**. This endpoint, typically `/metrics`, must expose data in a **format that Prometheus understands**. If the target doesn't have a compatible endpoint, a [[#Prometheus Exporter]] can be used to bridge the gap.
70+
71+
>[!question] How does Prometheus server find the targets?
72+
> Prometheus uses **service discovery services** to find the targets.
73+
74+
75+
76+
### Prometheus Exporter
77+
- An **exporter** is a piece of software that fetches metrics from a target application and converts them into a format [[Prometheus]] understands. It then exposes these metrics at the `/metrics` endpoint, where Prometheus can scrape them. You can find a list of exporters for different processes [here](https://prometheus.io/docs/instrumenting/exporters/)
78+
79+
>[!question] What if my application is running inside a container?
80+
> You can deploy the exporter as a **sidecar container** alongside your application container. You can also use [Prometheus client libraries](https://prometheus.io/docs/instrumenting/clientlibs/) to instrument your application code and directly expose metrics about requests, exceptions, and other custom metrics.
81+
82+
83+
### Prometheus Push Gateway
84+
- Pushgateway is a service that allows you to monitor **short-lived jobs**, like batch jobs or [[Cron Job|cron jobs]], with Prometheus
85+
86+
> [!question] why it's necessary:
87+
> **Short-lived jobs and scraping:** Prometheus typically works by "scraping" metrics from targets at regular intervals. If a job finishes before the scrape happens, Prometheus might miss its metrics entirely.
88+
>
89+
> **Pushgateway as a buffer:** The Pushgateway acts as an intermediary. Short-lived jobs *push* their metrics to the Pushgateway. Prometheus then scrapes the Pushgateway, ensuring those metrics are collected even if the job has already finished.
90+
91+
92+
### Prometheus Alert Manager
93+
- [[Prometheus]] evalutes the rules and push alerts to Prometheus alert manager which will fire the alerts to different notification channel like email
94+
95+
### Prometheus Data Storage
96+
- [[Prometheus]] stores **time series data** on a local disk-based time series database, but it can also optionally integrate with remote storage systems. The data is stored in a **custom time series format**, so we can't write it directly into a [[Database Paradigms#Relational|relational database]]
97+
98+
### Prometheus PromQL
99+
- The query format used to obtain data from the [[#Prometheus Data Storage]], used by data visulisation tool like Grafana to create nice dashboards
100+
- This is the **query language** used to retrieve data from [[#Prometheus Data Storage|Prometheus's time series database]]. Visualisation tools like **Grafana use PromQL** to create informative dashboards
101+
102+
%% Prometheus is the **de facto standard** for monitoring Kubernetes clusters. It easily integrates with Kubernetes by discovering services dynamically through Kubernetes' API, which enables it to monitor containerized applications without needing manual configuration of targets. %%
103+
104+
105+
## References
106+
---
107+
- [Learn Prometheus Architecture: A Complete Guide](https://devopscube.com/prometheus-architecture/)
108+
- [How Prometheus Monitoring works | Prometheus Architecture explained - YouTube](https://youtu.be/h4Sl21AKiDg?si=VlmLtxKRhGfGCrxZ)
1.12 MB
Loading

0 commit comments

Comments
 (0)