-
Notifications
You must be signed in to change notification settings - Fork 546
processor_tda: Add documentation for processor_tda #2277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
3b30847
1105939
155e3f9
a416f40
4bbcb37
14290e6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,288 @@ | ||
| # TDA (Topological Data Analysis) | ||
|
Check warning on line 1 in pipeline/processors/tda.md
|
||
|
|
||
| <img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=ee1ad690-a3e9-434f-9635-3e53c670e96c" /> | ||
cosmo0920 marked this conversation as resolved.
Show resolved
Hide resolved
cosmo0920 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| The `tda` processor applies **Topological Data Analysis (TDA)**—specifically, **persistent homology**—to Fluent Bit metrics stream and exports **Betti numbers** that summarize the shape of recent behavior in metric space. | ||
|
|
||
| This processor is intended for detecting **phase transitions**, **regime changes**, and **intermittent instabilities** that are difficult to detect from individual counters, gauges, or standard statistical aggregates. | ||
| It can, for example, differentiate between a single, one-off failure and an extended period of intermittent failures where the system never settles into a stable regime. | ||
| Currently, `tda` works only in the **metrics pipeline** (`processors.metrics`). | ||
|
|
||
| --- | ||
|
|
||
| ## Configuration parameters | ||
|
|
||
| The `tda` processor supports the following configuration parameters: | ||
|
|
||
| | Key | Description | Default | | ||
| | ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | | ||
| | `window_size` | Number of samples to keep in the TDA sliding window. This controls how far back in time the topology is estimated. | `60` | | ||
| | `min_points` | Minimum number of samples required in the window before running TDA. Until this limit is reached, no Betti metrics are emitted. | `10` | | ||
| | `embed_dim` | Delay embedding dimension `m`. `m = 1` disables embedding (original behavior). For example, `m = 3` reconstructs state vectors `(x_t, x_{t-τ}, x_{t-2τ})` as suggested by Takens theorem. | `3` | | ||
|
Check warning on line 21 in pipeline/processors/tda.md
|
||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Takens is a name of researcher. So, it should be valid but vale complains its term. |
||
| | `embed_delay` | Delay `τ` in samples between successive lags used in delay embedding. | `1` | | ||
| | `threshold` | Distance scale selector. `0` enables an automatic **multi-quantile scan** across several candidate thresholds; a value in `(0, 1)` is interpreted as a single quantile used to pick the Rips radius. | `0` | | ||
|
|
||
| All parameters are optional; defaults are suitable as a starting point for many workloads. | ||
|
|
||
| --- | ||
|
|
||
| ## How it works | ||
|
|
||
| ### 1. Metric aggregation and normalization | ||
|
|
||
| On each metrics flush, `tda`: | ||
|
|
||
| 1. **Groups metrics by `(namespace, subsystem)`** | ||
| All counters, gauges, and untyped metrics are traversed. For each `cmt_map`, the pair `(ns, subsystem)` is hashed and assigned a **feature index**. This produces a fixed-dimensional feature vector of length `feature_dim` (number of `(ns, subsystem)` groups). | ||
|
Check warning on line 36 in pipeline/processors/tda.md
|
||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Untyped is one of the metrics types of Prometheus like library which is called as cmetrics in Fluent Bit. It should be valid but vale complains this. |
||
|
|
||
| 2. **Aggregates values per group** | ||
| For each group, all static and labeled metrics are summed into the corresponding feature dimension. | ||
|
|
||
| 3. **Converts counters to approximate rates** | ||
| The processor keeps the previous raw snapshot `last_vec` and timestamp `last_ts`. For each dimension: | ||
|
|
||
| * `diff = now_raw - prev_raw` | ||
| * `dt_sec = (ts_now - ts_prev) / 1e9` | ||
| * `rate = diff / dt_sec` | ||
| A safeguard ensures `dt_sec > 0`. | ||
|
|
||
| 4. **Applies signed `log1p` normalization** | ||
| To stabilize very different magnitudes and bursty traffic, each rate is mapped to | ||
|
Check warning on line 50 in pipeline/processors/tda.md
|
||
| `norm = log1p(|rate|)`, and the sign of `rate` is reattached. This yields a vector that is roughly scale-invariant but still sensitive to relative changes in rates across groups. | ||
|
Check warning on line 51 in pipeline/processors/tda.md
|
||
|
|
||
| The resulting normalized vector is written into a **ring buffer window** (`tda_window`), implemented through a lightweight circular buffer (`lwrb`) that stores timestamped samples. | ||
| The window maintains at most `window_size` samples; older samples are dropped when the buffer is full. | ||
|
|
||
| ### 2. Sliding window and delay embedding | ||
|
|
||
| Let the ring buffer contain `n_raw` samples and the feature dimension be `D = feature_dim`. To capture temporal structure, `tda` supports an optional **delay embedding**: | ||
|
|
||
| * Embedding dimension: `m = embed_dim` (forced to `1` if `embed_dim <= 0`) | ||
| * Lag (in integer samples): `τ = embed_delay` (ignored when `m = 1`) | ||
|
|
||
| For each valid time index `t`, a reconstructed state vector is built as: | ||
|
|
||
| $$ | ||
| x_t ;\to; (x_t,; x_{t-\tau},; \dots,; x_{t-(m-1)\tau}) | ||
|
Check warning on line 66 in pipeline/processors/tda.md
|
||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is KaTeX style of notation but vale complains this. |
||
| $$ | ||
|
|
||
| where each `x_·` is the **D-dimensional normalized metrics vector** at that time. This yields embedded points in (\mathbb{R}^{mD}). | ||
|
|
||
| Because all lags must be inside the window, the number of embedded points is: | ||
|
|
||
| $$ | ||
| n_{\text{embed}} = n_{\text{raw}} - (m - 1)\tau | ||
| $$ | ||
|
|
||
| If `n_raw < (m − 1)τ + 1`, TDA is skipped until enough data has accumulated. | ||
|
|
||
| This embedding follows the idea of **Takens theorem**, which states that, under mild conditions, the dynamics of a system can be reconstructed from delay-embedded observations of a single time series or a low-dimensional observable [2]. In this plugin, the observable is the multi-dimensional vector of aggregated metrics. | ||
|
Check warning on line 79 in pipeline/processors/tda.md
|
||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Takens is a name of person who is a researcher for mathematics. |
||
|
|
||
| Intuitively: | ||
|
|
||
| * `embed_dim = 1`: only the current "snapshot" geometry is visible. | ||
| * `embed_dim > 1`: **loops and recurrent trajectories** in the joint evolution of metrics become visible, which later show up as **H₁ (Betti₁) features**. | ||
|
|
||
| ### 3. Distance matrix construction | ||
|
|
||
| For the embedded points $ x_i \in \mathbb{R}^{mD} $ (`i = 0..n_embed-1`), `tda` builds a **dense Euclidean distance matrix**: | ||
|
|
||
| $$ | ||
| d(i, j) = \left| x_i - x_j \right|_2 | ||
| $$ | ||
|
|
||
| The implementation iterates over all pairs `(i, j)` with `i > j`, accumulates squared differences across both feature dimensions and lags, and then takes the square root; the matrix is stored symmetrically with zeros on the diagonal. | ||
|
|
||
| ### 4. Threshold selection (Rips scale) | ||
|
|
||
| Persistent homology requires a **scale parameter** (Rips radius / distance threshold). The plugin supports two modes: | ||
|
|
||
| 1. **Automatic multi-quantile scan** (`threshold = 0`, default) | ||
|
|
||
| * The off-diagonal distances are collected, sorted, and several quantiles are evaluated, for example `q ∈ {0.10, 0.20, …, 0.90}`. | ||
| * For each candidate quantile `q`, a threshold `r_q` is chosen and Betti numbers are computed using Ripser. | ||
| * The plugin prefers the scale where **Betti₁** (loops) is maximized; if all Betti₁ are zero, it falls back to Betti₀ as a secondary indicator. | ||
|
|
||
| 2. **Fixed quantile mode** (`0 < threshold < 1`) | ||
|
|
||
| * `threshold` is interpreted as a single quantile `q`. The Rips radius is set at this quantile of all pairwise distances. | ||
| * The multi-quantile scan still runs internally for robustness, but reported diagnostics (For example, debug logs) will reflect the user-selected quantile. | ||
|
|
||
| Internally, quantile selection is handled by `tda_choose_threshold_from_dist`, which gathers all `i > j` entries of the distance matrix, sorts them, and picks the specified quantile index. | ||
|
|
||
| ### 5. Persistent Homology through Ripser | ||
|
|
||
| Once the compressed lower-triangular distance matrix is built, it is passed to a thin wrapper around **Ripser**, a well-known implementation of Vietoris-Rips persistent homology: | ||
|
|
||
| 1. **Compression and C API** | ||
|
|
||
| * The dense `n_embed × n_embed` matrix is converted into Ripser's `compressed_lower_distance_matrix`. | ||
| * The wrapper function `flb_ripser_compute_betti_from_dense_distance` runs Ripser up to `max_dim = 2` (H₀, H₁, H₂), using coefficients in ($\mathbb{Z}/2\mathbb{Z}$), and accumulates persistence intervals into Betti numbers with a small persistence cutoff to ignore very short-lived noise features. | ||
|
|
||
| 2. **Interval aggregation** | ||
|
|
||
| * A callback (`interval_recorder`) receives all persistence intervals ($\text{birth}$, $\text{death}$) from Ripser. | ||
| * Intervals with very small persistence are filtered out, and the remaining ones are counted per homology dimension to form Betti numbers. | ||
|
|
||
| 3. **Multi-scale selection** | ||
|
|
||
| * For each candidate threshold, Betti numbers are computed. | ||
| * The "best" scale is chosen as the one with the largest Betti₁ (loops); if Betti₁ is zero across scales, the plugin picks the scale where Betti₀ is largest. | ||
| * The corresponding Betti₀, Betti₁, and Betti₂ values are then exported as Fluent Bit gauges. | ||
|
|
||
| ### 6. Exported metrics | ||
|
|
||
| `tda` creates (lazily) three gauge metrics in the `fluentbit_tda_*` namespace: | ||
|
|
||
| | Metric name | Type | Description | | ||
| | ---------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ||
| | `fluentbit_tda_betti0` | gauge | Approximate Betti₀. The number of connected components (clusters) in the embedded point cloud at the selected scale. Large values indicate fragmentation into many "micro-regimes". | | ||
| | `fluentbit_tda_betti1` | gauge | Approximate Betti₁. The number of 1-dimensional loops / cycles in the Rips complex. Non-zero values often signal **recurrent, quasi-periodic, or cycling behavior**, typical of intermittent failure / recovery patterns and other regime switches. | | ||
| | `fluentbit_tda_betti2` | gauge | Approximate Betti₂. The number of 2-dimensional voids (higher-order structures). These can appear when the system explores different "surfaces" in state space, for example, transitioning between distinct operating modes. | | ||
|
|
||
| Each metric is timestamped with the current time at the moment of TDA computation and is exported through the same metrics context it received, so downstream metric outputs can scrape or forward them like any other Fluent Bit metric. | ||
|
|
||
| --- | ||
|
|
||
| ## Interpreting Betti numbers | ||
|
|
||
| Topologically, Betti numbers count the number of "holes" of each dimension in a space: | ||
|
|
||
| * **Betti₀**: connected components (0-dimensional clusters). | ||
| * **Betti₁**: 1-dimensional holes (loops / cycles). | ||
| * **Betti₂**: 2-dimensional voids, and so on. | ||
|
|
||
| In our context: | ||
|
|
||
| * The sliding window of metrics is a **point cloud in phase space**. | ||
| * The Rips complex at a given scale connects points that are close in this space. | ||
| * Betti numbers summarize the topology of this complex. | ||
|
|
||
| Some practical patterns: | ||
|
|
||
| 1. **Stable regime** | ||
|
|
||
| * Metrics fluctuate near a single attractor. | ||
| * Betti₀ is small (often close to 1-few and saturated on a long running), Betti₁ and Betti₂ are typically `0` or very small. | ||
|
|
||
| 2. **Single, one-off failure** | ||
|
|
||
| * A brief outage or spike happens once and resolves. | ||
| * The embedding sees a short excursion but no sustained cycling, so Betti₁ and Betti₂ often remain near `0`. | ||
| * In the provided HTTP example, a single failing chunk does not significantly raise Betti₁/₂. | ||
|
|
||
| 3. **Intermittent failure / unstable regime** | ||
|
|
||
| * The system repeatedly bounces between "healthy" and "unhealthy" states (For example, repeated `Connection refused` / `broken connection` errors interspersed with 200 responses). | ||
| * The trajectory in phase space forms **loops**: metrics move away from the healthy region and then return, many times. | ||
| * Betti₁ (and occasionally Betti₂) increases noticeably while this behavior persists, reflecting the emergence of non-trivial cycles in the metric dynamics. | ||
|
|
||
| In the sample output, the HTTP output oscillates between success and various "Connection refused" and "broken connection" errors. | ||
| As this occurs, `fluentbit_tda_betti1` and `fluentbit_tda_betti2` grow from small values to larger plateaus (for example, Betti₁ around 10—13, Betti₂ around 1—2) while Betti₀ also increases. | ||
| This is a direct signature of a **phase transition** from a stable regime to one with persistent, intermittent instability. | ||
|
|
||
| These interpretations are consistent with results from condensed matter physics and dynamical systems, where persistent homology has been used to detect phase transitions and changes in underlying order purely from data (References 1 and 2). | ||
|
|
||
| --- | ||
|
|
||
| ## Configuration examples | ||
|
|
||
| ### Basic setup with `fluentbit_metrics` | ||
|
|
||
| The following example computes TDA on Fluent Bit's own internal metrics, using `metrics_selector` to remove a few high-cardinality or uninteresting metrics before feeding them into `tda`: | ||
|
|
||
| ```yaml | ||
| service: | ||
| http_server: On | ||
| http_port: 2021 | ||
|
|
||
| pipeline: | ||
| inputs: | ||
| - name: dummy | ||
| tag: log.raw | ||
| samples: 10000 | ||
|
|
||
| - name: fluentbit_metrics | ||
| tag: metrics.raw | ||
|
|
||
| processors: | ||
| metrics: | ||
| # Optionally exclude metrics you don't want in the TDA feature vector | ||
| - name: metrics_selector | ||
| metric_name: /process_start_time_seconds/ | ||
| action: exclude | ||
|
|
||
| - name: metrics_selector | ||
| metric_name: /build_info/ | ||
| action: exclude | ||
|
|
||
| # Perform TDA on the remaining metrics | ||
| - name: tda | ||
| # window_size: 60 # optional tuning | ||
| # min_points: 10 | ||
| # embed_dim: 3 | ||
| # embed_delay: 1 | ||
| # threshold: 0 # auto multi-quantile scan | ||
|
|
||
| outputs: | ||
| - name: stdout | ||
| match: '*' | ||
| ``` | ||
|
|
||
| With this configuration, you will see time series like: | ||
|
|
||
| ```text | ||
| fluentbit_tda_betti0 = 39 | ||
| fluentbit_tda_betti1 = 7 | ||
| fluentbit_tda_betti2 = 0 | ||
| ... | ||
| fluentbit_tda_betti0 = 56 | ||
| fluentbit_tda_betti1 = 13 | ||
| fluentbit_tda_betti2 = 2 | ||
| ``` | ||
|
|
||
| These Betti metrics can be scraped by Prometheus, forwarded to an observability backend, and used in alerts (for example, triggering on sudden increases in `fluentbit_tda_betti1` as a signal of emerging instability in the pipeline). | ||
|
|
||
| ### Emphasizing short-term cycles with delay embedding | ||
|
|
||
| To focus on shorter-term cyclic behavior—for example, oscillations in retry logic and error counters—you can lower `window_size` and adjust the embedding parameters: | ||
|
|
||
| ```yaml | ||
| processors: | ||
| metrics: | ||
| - name: tda | ||
| window_size: 30 # shorter temporal horizon | ||
| min_points: 15 # require at least half the window | ||
| embed_dim: 4 # look at 4 successive states | ||
| embed_delay: 1 # each lag = 1 metrics interval | ||
| threshold: 0.2 # use 20th percentile of distances | ||
| ``` | ||
|
|
||
| This configuration reconstructs the system in an effective dimension of `4 × feature_dim` and tends to highlight tight loops that occur within roughly 4-10 sampling intervals. | ||
|
|
||
| --- | ||
|
|
||
| ## When to use `tda` | ||
|
|
||
| `tda` is particularly useful when: | ||
|
|
||
| * You suspect **non-linear or multi-modal behavior** in your system (For example, on/off regimes, congestion collapse, periodic retries). | ||
| * Standard indicators (mean, percentiles, error rates) show "noise," but you want to know whether that noise hides **coherent structure**. | ||
| * You want to build alerts not simply on "levels" of metrics, but on **changes in the topology** of system behavior. For example: | ||
|
|
||
| * "Raise an alert if Betti₁ remains above 5 for more than 5 minutes." | ||
| * "Mark windows where Betti₂ becomes non-zero as potential phase transitions." | ||
|
|
||
| Because the plugin operates on an arbitrary selection of metrics (chosen upstream through `metrics_selector` or by how you configure `fluentbit_metrics`), you can tailor the TDA to focus on: | ||
|
|
||
| * Network health (latency histograms, connection failures, TLS handshake errors), | ||
| * Resource saturation (CPU, memory, buffer usage), | ||
| * Pipeline-level signals (retries, DLQ usage, chunk failures), | ||
| * Or any other metric subset that meaningfully characterizes the state of your system. | ||
|
|
||
| --- | ||
|
|
||
| ## References | ||
|
|
||
| 1. I. Donato, M. Gori, A. Sarti, "Persistent homology analysis of phase transitions," _Physical Review E_, 93, 052138, 2016. | ||
| 2. F. Takens, "Detecting strange attractors in turbulence," in D. Rand and L.-S. Young (eds.), _Dynamical Systems and Turbulence_, Lecture Notes in Mathematics, vol. 898, Springer, 1981, pp. 366-381. | ||
Uh oh!
There was an error while loading. Please reload this page.