Skip to content

Cherry-pick #44692: Optimize data collection: add index and batch deletes#44729

Merged
sgress454 merged 1 commit intorc-minor-fleet-v4.85.0from
sgress454/optimize-data-collection-cp
May 5, 2026
Merged

Cherry-pick #44692: Optimize data collection: add index and batch deletes#44729
sgress454 merged 1 commit intorc-minor-fleet-v4.85.0from
sgress454/optimize-data-collection-cp

Conversation

@sgress454
Copy link
Copy Markdown
Contributor

Cherry-pick of #44692 into the RC branch.

<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #44609

# Details

This PR optimizes the historical data collection system in two ways:

1. Adds an additional index on the `host_scd_data` table allowing more
efficient lookups of rows by their `valid_to`, to optimize both closing
out open rows and deleting old rows
2. Implements batching in the job that deletes old rows, so that it no
longer blocks writes if the collection job happens to happen at the same
time as the cleanup job

# Checklist for submitter

If some of the following don't apply, delete the relevant line.

- [ ] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
See [Changes
files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files)
for more information.
n/a, unreleased
- [X] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements), JS
inline code is prevented especially for url redirects, and untrusted
data interpolated into shell scripts/commands is validated against shell
metacharacters.
- [ ] Timeouts are implemented and retries are limited to avoid infinite
loops

## Testing

- [ ] Added/updated automated tests
- [X] Where appropriate, [automated tests simulate multiple hosts and
test for host
isolation](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/reference/patterns-backend.md#unit-testing)
(updates to one hosts's records do not affect another)

- [X] QA'd all new/changed functionality manually

SQL explains -- before:

```
+----+-------------+---------------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| id | select_type | table         | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+---------------+------------+------+---------------+------+---------+------+--------+----------+-------------+
|  1 | DELETE      | host_scd_data | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 144320 |   100.00 | Using where |
+----+-------------+---------------+------------+------+---------------+------+---------+------+--------+----------+-------------+

+----+-------------+---------------+------------+-------+--------------------------------------+--------------------+---------+-------------+------+----------+-------------+
| id | select_type | table         | partitions | type  | possible_keys                        | key                | key_len | ref         | rows | filtered | Extra       |
+----+-------------+---------------+------------+-------+--------------------------------------+--------------------+---------+-------------+------+----------+-------------+
|  1 | UPDATE      | host_scd_data | NULL       | range | uniq_entity_bucket,idx_dataset_range | uniq_entity_bucket | 604     | const,const | 3030 |   100.00 | Using where |
+----+-------------+---------------+------------+-------+--------------------------------------+--------------------+---------+-------------+------+----------+-------------+
```

Using a test set of data (~144k "open" rows), UPDATES happened at 9 ops
per second.

after:

```
+----+-------------+---------------+------------+-------+----------------------+----------------------+---------+-------+-------+----------+-------------+
| id | select_type | table         | partitions | type  | possible_keys        | key                  | key_len | ref   | rows  | filtered | Extra       |
+----+-------------+---------------+------------+-------+----------------------+----------------------+---------+-------+-------+----------+-------------+
|  1 | DELETE      | host_scd_data | NULL       | range | idx_valid_to_dataset | idx_valid_to_dataset | 5       | const | 55749 |   100.00 | Using where |
+----+-------------+---------------+------------+-------+----------------------+----------------------+---------+-------+-------+----------+-------------+

+----+-------------+---------------+------------+-------+-----------------------------------------------------------+----------------------+---------+-------------------+------+----------+------------------------------+
| id | select_type | table         | partitions | type  | possible_keys                                             | key                  | key_len | ref               | rows | filtered | Extra                        |
+----+-------------+---------------+------------+-------+-----------------------------------------------------------+----------------------+---------+-------------------+------+----------+------------------------------+
|  1 | UPDATE      | host_scd_data | NULL       | range | uniq_entity_bucket,idx_dataset_range,idx_valid_to_dataset | idx_valid_to_dataset | 609     | const,const,const |    4 |   100.00 | Using where; Using temporary |
+----+-------------+---------------+------------+-------+-----------------------------------------------------------+----------------------+---------+-------------------+------+----------+------------------------------+
```

Using the same test set of data, UPDATES happened at 4,910 ops per
second.

For unreleased bug fixes in a release candidate, one of:

- [X] Confirmed that the fix is not expected to adversely impact load
test results
this should significantly improve results!
- [ ] Alerted the release DRI if additional load testing is needed

## Database migrations

- [X] Checked schema for all modified table for columns that will
auto-update timestamps during migration.
- [ ] Confirmed that updating the timestamps is acceptable, and will not
cause unwanted side effects.
- [ ] Ensured the correct collation is explicitly set for character
columns (`COLLATE utf8mb4_unicode_ci`).

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* Cleanup now runs in controlled, ordered batches, removing only
closed/historical records while respecting cancellation; error reporting
for cleanup was strengthened.
* Added a new composite index on historical data to improve cleanup and
query performance.
* **Tests**
* Added tests and test helpers validating batched cleanup behavior,
preservation of open records, multi-batch operation, and cancellation
handling.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@sgress454 sgress454 requested a review from a team as a code owner May 5, 2026 13:33
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@sgress454 sgress454 merged commit ec2560b into rc-minor-fleet-v4.85.0 May 5, 2026
32 of 33 checks passed
@sgress454 sgress454 deleted the sgress454/optimize-data-collection-cp branch May 5, 2026 14:02
@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 92.30769% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.79%. Comparing base (9de8af6) to head (eb7b98f).
⚠️ Report is 18 commits behind head on rc-minor-fleet-v4.85.0.

Files with missing lines Patch % Lines
server/chart/internal/mysql/data.go 76.47% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@                    Coverage Diff                     @@
##           rc-minor-fleet-v4.85.0   #44729      +/-   ##
==========================================================
+ Coverage                   66.72%   66.79%   +0.07%     
==========================================================
  Files                        2626     2628       +2     
  Lines                      211198   211402     +204     
  Branches                     9428     9425       -3     
==========================================================
+ Hits                       140924   141211     +287     
+ Misses                      57485    57380     -105     
- Partials                    12789    12811      +22     
Flag Coverage Δ
backend 68.58% <92.30%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants