fix(models/up): harden health check against abuse-inflated baselines#2392
fix(models/up): harden health check against abuse-inflated baselines#2392markijbema merged 1 commit intomainfrom
Conversation
Re-enable gemini-3.1-pro-preview monitoring and add protections: - Add COUNT(DISTINCT kilo_user_id) to detect concentrated traffic - Require MIN_UNIQUE_USERS_FOR_ALERT (20) distinct users in baseline before alerting on drops — prevents abuse actors from inflating baselines with many accounts - Add 10s statement timeout (fail-open) to prevent query from holding a connection on the largest table (~469M rows) - Enrich response with uniqueUsersCurrent/uniqueUsersBaseline for investigator visibility - Fail open on errors (timeout/DB) since those aren't evidence of model outages Incident: 2026-04-13-gemini-3.1-pro-preview-monitor-abuse
| }); | ||
|
|
||
| return NextResponse.json({ healthy: false }, { status: 503 }); | ||
| // Fail open: a query timeout or DB error is not evidence of a model being down. |
There was a problem hiding this comment.
query timeout is currently 15 times actual time it takes to load
| hasSignificantDrop = true; | ||
| // - Only alert if the baseline had enough distinct users to represent organic traffic | ||
|
|
||
| if (uniqueUsersBaseline >= MIN_UNIQUE_USERS_FOR_ALERT) { |
There was a problem hiding this comment.
WARNING: Distinct-user threshold does not block the incident pattern
The incident in the PR description already came from a single actor spread across ~280 accounts. Because unique_users_baseline is counting distinct kilo_user_ids over the whole baseline window, that actor still clears the >= 20 gate and the same pause will continue to look like an organic drop. To protect against the described false positive, this check needs a signal the actor cannot inflate by creating accounts (for example IP / JA4 / user-agent concentration, or a per-identity cap).
There was a problem hiding this comment.
I think we need to look at this specific example as well, but this is probably a good step for now
There was a problem hiding this comment.
I think it is a good step. As follow up, worth exploring a concentration based signal. Like you mentioned the IP join is expensive, but combining it with users would make the baseline less prone to these attacks since user-level alone wouldn't have caught this one (280 accounts).
Code Review SummaryStatus: 1 Issues Found | Recommendation: Address before merge Overview
Fix these issues in Kilo Cloud Issue Details (click to expand)WARNING
Other Observations (not in diff)Issues found in unchanged code that cannot receive inline comments:
Files Reviewed (1 files)
Reviewed by gpt-5.4-20260305 · 224,898 tokens |
Summary
Hardens the
/api/models/uphealth check against false positives caused by traffic inflating baselines. Re-enablesgoogle/gemini-3.1-pro-previewmonitoring (removed in #2388) with proper protections.Root cause (2026-04-13 incident): A single actor sent 1.7M requests (81.5% of model traffic) from 4 IPs across ~280 accounts. When they paused, the artificially high baseline made organic traffic look like a >90% drop, triggering a false alert.
Changes:
COUNT(DISTINCT kilo_user_id)to the existing query (sameGROUP BY requested_model, no new grouping dimension) to track user concentrationuniqueUsersCurrent/uniqueUsersBaselinefor investigator visibilityhealthy: true) since DB errors aren't evidence of model outagesWhat this doesn't do (intentionally):
GROUP BYcapping — too expensive on the largest table without a supporting indexmicrodollar_usage_metadata+ normalized tablesCOUNT(DISTINCT)causes regressionVerification
pnpm typecheck— no new errors (pre-existinglinkify-iterror unrelated)gemini-3.1-pro-previewis back in the response at/api/models/upwithuniqueUsersCurrent/uniqueUsersBaselinefields populatedqueryExecutionTimeMs— target <2s, budget <5sVisual Changes
N/A
Reviewer Notes
COUNT(DISTINCT kilo_user_id)adds per-group hash sets, but the number of groups is small (~10 monitored models) so this should be boundedhealthy: false, 503) — rationale is that a DB timeout should not page on-call for model health