Cherry-pick #44769: Allow disabling chart datasets: backend#44924
Conversation
<!-- Add the related story/sub-task/bug number, like Resolves #123, or remove if NA --> **Related issue:** For #44077 # Details This PR implements enforcement of the "disable dataset" feature. When a dataset is disabled globally, we: * Stop collecting all data for that dataset (the `Collect` method for that dataset is not called in the cron job) * Remove all previously-collected data for the dataset via an asynchronous job When a dataset is disabled for one or more fleets, we: * Provide the list of disabled fleets as an argument to each dataset's `Collect` method. Each dataset is responsible for filtering out hosts in the most efficient way possible * Scrub the data for the relevant datasets using a bitmask, so that all hosts from the disabled fleets are removed from the data. This is done via an asynchronous job. # Checklist for submitter If some of the following don't apply, delete the relevant line. - [ ] Changes file added for user-visible changes in `changes/`, `orbit/changes/` or `ee/fleetd-chrome/changes`. See [Changes files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files) for more information. n/a, unreleased - [X] Input data is properly validated, `SELECT *` is avoided, SQL injection is prevented (using placeholders for values in statements), JS inline code is prevented especially for url redirects, and untrusted data interpolated into shell scripts/commands is validated against shell metacharacters. - [X] Timeouts are implemented and retries are limited to avoid infinite loops ## Testing - [X] Added/updated automated tests - [X] Where appropriate, [automated tests simulate multiple hosts and test for host isolation](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/reference/patterns-backend.md#unit-testing) (updates to one hosts's records do not affect another) - [ ] QA'd all new/changed functionality manually ### Prerequisites / Test Setup - [ ] Fleet running with at least 3 teams (call them T1, T2, T3) and ≥3 hosts in each, plus ≥2 hosts with no team - [ ] At least one host on each team has reported recent uptime (within the bucket window) - [ ] At least one host in each team is affected by a tracked CVE (so `host_scd_data` for `dataset='cve'` will have non-empty bitmaps) - [ ] AppConfig: both `features.historical_data.uptime` and `features.historical_data.vulnerabilities` start as `true`; same for every team - [ ] Let the collection cron run at least one full tick to populate baseline rows in `host_scd_data` for both `uptime` and `cve` - [ ] Note the current row count per dataset: `SELECT dataset, COUNT(*) FROM host_scd_data GROUP BY dataset;` --- ### 1. Cron Skips Globally-Disabled Datasets #### 1.1 Global disable of `uptime` - [x] Disable globally: `PATCH /api/v1/fleet/config` with `features.historical_data.uptime = false` - [x] Verify activity feed shows `disabled_historical_dataset` for `uptime` (existing behavior) - [x] Wait for next collection tick (or trigger it via fleetctl debug if available) - [x] Confirm **no new rows** appear for `dataset='uptime'`: `SELECT MAX(valid_from) FROM host_scd_data WHERE dataset='uptime';` should not advance after the disable - [x] Confirm cron still writes `cve` rows on the same tick (per-dataset isolation) - [x] Re-enable: PATCH `historical_data.uptime = true` - [x] Verify next tick resumes writing `uptime` rows #### 1.2 Global disable of `vulnerabilities` - [x] Repeat 1.1 with `features.historical_data.vulnerabilities` - [x] Confirm `cve` writes stop, `uptime` continues #### 1.3 Both disabled globally - [x] Disable both globally - [x] Confirm cron tick produces zero new rows for either dataset - [x] Confirm cron does not error or get stuck - [x] Re-enable both --- ### 2. Per-Fleet Disable — Cron Filters at SQL #### 2.1 Single team disabled for one dataset - [x] Disable uptime for T1 only: PATCH team T1 with `features.historical_data.uptime = false` - [x] Verify scoped `disabled_historical_dataset` activity emitted for T1 - [x] Wait for next cron tick / trigger cron - [x] Pick a host known to be in T1 (call it `H_T1`); confirm its bit is NOT set in any `uptime` row written *after* the disable by filtering the chart to that host - [x] Pick a host in T2 (`H_T2`); confirm its bit IS still set in the same rows (T2 is not disabled) - [x] Pick a no-team host (`H_none`); confirm its bit IS still set (no-team hosts follow the global value) #### 2.2 Same fleet, different dataset - [x] With T1's uptime disabled, confirm T1's hosts ARE still written into `cve` rows on subsequent ticks (per-dataset isolation) #### 2.3 All teams disabled, global on, no-team hosts - [x] Disable uptime on every team (T1, T2, T3) - [x] Confirm next tick still writes a row containing only no-team hosts' bits (global is on, no-team hosts always count) - [x] Re-enable uptime on all teams --- ### 3. Global Scrub — DELETE #### 3.1 Successful global scrub - [x] Note baseline: `SELECT COUNT(*) FROM host_scd_data WHERE dataset='uptime';` (should be > 5000 to exercise the loop; if not, manually insert filler rows or run multiple cron ticks) - [x] Disable uptime globally via the API - [x] Wait for the worker to pick up the scrub / trigger the job - [x] Confirm the count drops to 0: `SELECT COUNT(*) FROM host_scd_data WHERE dataset='uptime';` - [x] Confirm rows for **other datasets** are untouched - [ ] Test again but disable via GitOps --- ### 4. Per-Fleet Scrub — ANDNOT #### 4.1 Single-fleet scrub clears bits - [x] Identify hosts in T1 and record their IDs (call this set `S`) - [x] Pre-disable, confirm at least one `host_scd_data` row for `dataset='uptime'` has bits set at positions in `S` by filtering the chart to those hosts - [x] Disable uptime on T1 only, via the API - [x] Wait for the scrub to run / trigger it - [x] Confirm: every existing row for `dataset='uptime'` now has NO bits set at any position in `S`. Spot-check by filtering the chart to those hosts - [x] Confirm rows for `dataset='cve'` (different dataset) are untouched - [x] Confirm bits for hosts in T2/T3 (not disabled) are still set - [x] Run test again but disable via GitOps #### 4.2 Multi-fleet scrub via GitOps batch - [x] Apply a GitOps spec that flips cve to false on T1 and T3 in a single apply - [x] Wait for scrub(s) to complete - [x] Confirm bits for the union of T1∪T2 hosts are cleared from every row of `dataset='cve'` - [x] Confirm T2 hosts' bits remain set --- ### 5. Activity Feed Cross-Check - [x] Each global flip emits exactly one `disabled_historical_dataset` activity (existing behavior, unchanged) - [x] Each per-team flip emits one scoped activity with the team's ID and name - [x] PATCH submitting unchanged values emits **no** activity and causes **no** scrub (no `host_scd_data` data change observed after the cron tick) - [x] No new "scrub completed" or "scrub started" activity is emitted (out of scope for v1) - [x] Re-enable flips emit `enabled_historical_dataset` activities and do NOT emit any scrub-related activity --- ### 6. Regression Spot Checks - [x] With everything enabled (default), the chart UI renders the same data as before this change (no behavior change in the "all on" case) - [x] AppConfig YAML round-trip (`fleetctl apply`) is benign: applying the unchanged config produces no scrub jobs and no activities - [x] GitOps apply with `historical_data` omitted from team specs defaults to `true` (per the gitops-api change) and does not trigger spurious scrubs - [x] After a full disable+scrub of cve, the `host_scd_data` table has no `dataset='cve'` rows; the chart UI for "vulnerable hosts over time" shows an empty/zero state without errors --- <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Chart collection now supports per-dataset scoping and honors team-level disables; new scrub jobs are registered and worker handlers added. * New dataset scrub operations: global and fleet-scoped scrubs; scrubs can be enqueued and are deduplicated to avoid duplicate pending jobs. Historical-data changes enqueue scrubs after save (errors logged, non-blocking). * **Tests** * Added unit tests for scope resolution, scrub enqueue/dedup behavior, scrub workers, scrub application, and low-level blob scrub logic. * **Documentation** * Added OpenSpec metadata for the chart scrub change. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
@sgress454 Konstantin is out today and tomorrow so I did a quick scan and re-approved this based on his PR approval. |
Thanks! I also sprung for the GitHub Claude review AND a last-minute "max effort" local review on this one (well, the main one, not the cherry pick). |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## rc-minor-fleet-v4.85.0 #44924 +/- ##
==========================================================
+ Coverage 66.72% 66.79% +0.07%
==========================================================
Files 2626 2630 +4
Lines 211198 211797 +599
Branches 9428 9423 -5
==========================================================
+ Hits 140924 141477 +553
+ Misses 57485 57477 -8
- Partials 12789 12843 +54
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Cherry-pick of #44769 into the RC branch.