|
| 1 | +--- |
| 2 | +Author: |
| 3 | + - Xinyang YU |
| 4 | +Author Profile: |
| 5 | + - https://linkedin.com/in/xinyang-yu |
| 6 | +tags: |
| 7 | + - devops |
| 8 | +Creation Date: 2024-11-06, 14:39 |
| 9 | +Last Date: 2024-11-07T16:51:48+08:00 |
| 10 | +References: |
| 11 | +draft: |
| 12 | +description: |
| 13 | +--- |
| 14 | +## Abstract |
| 15 | +--- |
| 16 | +- The **standard monitoring tool** used in **container and microservices infrastructure** |
| 17 | + |
| 18 | +>[!question] Why we need Prometheus? |
| 19 | +> **Hundreds of interconnected processes** serve users in these environments. If one process fails, it can trigger a **cascade of failures in other processes**, making it **difficult to pinpoint the root cause** from the end user's perspective. Prometheus provides visibility into the application and infrastructure of these hundreds of processes, **enabling proactive identification of issues instead of reactive debugging**. With Prometheus, we can continuously monitor processes, receive alerts about crashes, and even configure alerts for predefined thresholds. |
| 20 | +
|
| 21 | +### Prometheus Metric |
| 22 | +- **HELP attribute:** Provides a clear description of the metric's purpose |
| 23 | +- **TYPE attribute:** Defines the metric's type. Prometheus supports four core metric types: |
| 24 | + - **Counter:** A cumulative metric that represents a single monotonically increasing counter. Its value can only increase or be reset to zero on restart. Examples: the number of requests served, tasks completed, or errors encountered |
| 25 | + - **Gauge:** A metric that represents a single numerical value that can arbitrarily go up and down. Examples: current memory usage, temperature, or the number of active processes |
| 26 | + - **Histogram:** Samples observations (like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values |
| 27 | + - **Summary?**: Similar to histograms, but they calculate quantiles on the client-side and expose them directly. Useful for tracking percentiles of observations like request latencies |
| 28 | + |
| 29 | +### Prometheus Configuration File |
| 30 | +```yaml |
| 31 | +global: |
| 32 | + scrape_interval: 15s |
| 33 | + evaluation_interval: 15s |
| 34 | + |
| 35 | +rule_files: |
| 36 | + - "first.rules" |
| 37 | + - "second.rules" |
| 38 | + |
| 39 | +scrape_configs: |
| 40 | + - job_name: "prometheus" |
| 41 | + static_configs: |
| 42 | + - targets: ['localhost:9090'] |
| 43 | + - job_name: node_exporter |
| 44 | + scrape_interval: 1m |
| 45 | + scrape_timeout: 1m |
| 46 | + static_configs: |
| 47 | + - targets: ['localhost:9100'] |
| 48 | +``` |
| 49 | +
|
| 50 | +
|
| 51 | +- Under `global`, `scrape_interval: 15s` instructs the Prometheus server to **scrape target endpoints** every **15 seconds**. `evaluation_interval: 15s` runs the rules defined in `rules_files` every **15 seconds**, potentially triggering alerts in the [[#Prometheus Alert Manager|Prometheus Alert Manager]] |
| 52 | +- Scrape job details are specified under `scrape_configs`, allowing you to **override global settings** like `scrape_interval` |
| 53 | + |
| 54 | +>[!important] Default setting for each scrape job |
| 55 | +> We have default settings like `metrics_path: "/metrics"` and `scheme: "http"` for each job. |
| 56 | + |
| 57 | +## Prometheus Architecture |
| 58 | +--- |
| 59 | +![[prometheus_architecture.gif]] |
| 60 | + |
| 61 | +### Prometheus Metric Retrieval |
| 62 | +- The Prometheus server has a **metrics retrieval** component that **pulls data** from different **targets**, which are essentially various processes. This data consists of **units** such as CPU status, exception count, and request count. These units are considered [[#Prometheus Metric]] when monitored by Prometheus. The metrics are then stored in the [[#Prometheus Data Storage|time-series database (TSDB)]] |
| 63 | +- Prometheus also includes an HTTP server component that accepts PromQL queries, enabling integration with visualisation tools like Grafana |
| 64 | + |
| 65 | +>[!important] A pulling system |
| 66 | +> Common monitoring services like AWS CloudWatch and [[Datadog]] use a **pushing** approach, where all processes push metrics to a centralized collection platform. This can create a **high load of network traffic**, and monitoring can become a bottleneck. Additionally, we need to install a daemon on all processes to push metrics, while Prometheus only needs a **scraping endpoint**. The pulling approach is also a convenient way to check if a process is up. |
| 67 | + |
| 68 | +>[!question] How is Prometheus metric pulled? |
| 69 | +> Prometheus metrics are pulled by accessing an **HTTP endpoint on the target**. This endpoint, typically `/metrics`, must expose data in a **format that Prometheus understands**. If the target doesn't have a compatible endpoint, a [[#Prometheus Exporter]] can be used to bridge the gap. |
| 70 | + |
| 71 | +>[!question] How does Prometheus server find the targets? |
| 72 | +> Prometheus uses **service discovery services** to find the targets. |
| 73 | + |
| 74 | + |
| 75 | + |
| 76 | +### Prometheus Exporter |
| 77 | +- An **exporter** is a piece of software that fetches metrics from a target application and converts them into a format [[Prometheus]] understands. It then exposes these metrics at the `/metrics` endpoint, where Prometheus can scrape them. You can find a list of exporters for different processes [here](https://prometheus.io/docs/instrumenting/exporters/) |
| 78 | + |
| 79 | +>[!question] What if my application is running inside a container? |
| 80 | +> You can deploy the exporter as a **sidecar container** alongside your application container. You can also use [Prometheus client libraries](https://prometheus.io/docs/instrumenting/clientlibs/) to instrument your application code and directly expose metrics about requests, exceptions, and other custom metrics. |
| 81 | + |
| 82 | + |
| 83 | +### Prometheus Push Gateway |
| 84 | +- Pushgateway is a service that allows you to monitor **short-lived jobs**, like batch jobs or [[Cron Job|cron jobs]], with Prometheus |
| 85 | + |
| 86 | +> [!question] why it's necessary: |
| 87 | +> **Short-lived jobs and scraping:** Prometheus typically works by "scraping" metrics from targets at regular intervals. If a job finishes before the scrape happens, Prometheus might miss its metrics entirely. |
| 88 | +> |
| 89 | +> **Pushgateway as a buffer:** The Pushgateway acts as an intermediary. Short-lived jobs *push* their metrics to the Pushgateway. Prometheus then scrapes the Pushgateway, ensuring those metrics are collected even if the job has already finished. |
| 90 | + |
| 91 | + |
| 92 | +### Prometheus Alert Manager |
| 93 | +- [[Prometheus]] evalutes the rules and push alerts to Prometheus alert manager which will fire the alerts to different notification channel like email |
| 94 | + |
| 95 | +### Prometheus Data Storage |
| 96 | +- [[Prometheus]] stores **time series data** on a local disk-based time series database, but it can also optionally integrate with remote storage systems. The data is stored in a **custom time series format**, so we can't write it directly into a [[Database Paradigms#Relational|relational database]] |
| 97 | + |
| 98 | +### Prometheus PromQL |
| 99 | +- The query format used to obtain data from the [[#Prometheus Data Storage]], used by data visulisation tool like Grafana to create nice dashboards |
| 100 | +- This is the **query language** used to retrieve data from [[#Prometheus Data Storage|Prometheus's time series database]]. Visualisation tools like **Grafana use PromQL** to create informative dashboards |
| 101 | + |
| 102 | +%% Prometheus is the **de facto standard** for monitoring Kubernetes clusters. It easily integrates with Kubernetes by discovering services dynamically through Kubernetes' API, which enables it to monitor containerized applications without needing manual configuration of targets. %% |
| 103 | + |
| 104 | + |
| 105 | +## References |
| 106 | +--- |
| 107 | +- [Learn Prometheus Architecture: A Complete Guide](https://devopscube.com/prometheus-architecture/) |
| 108 | +- [How Prometheus Monitoring works | Prometheus Architecture explained - YouTube](https://youtu.be/h4Sl21AKiDg?si=VlmLtxKRhGfGCrxZ) |
0 commit comments