Skip to content

fix(models/up): harden health check against abuse-inflated baselines#2392

Merged
markijbema merged 1 commit intomainfrom
mark/harden-models-up-health-check
Apr 14, 2026
Merged

fix(models/up): harden health check against abuse-inflated baselines#2392
markijbema merged 1 commit intomainfrom
mark/harden-models-up-health-check

Conversation

@markijbema
Copy link
Copy Markdown
Contributor

@markijbema markijbema commented Apr 14, 2026

Summary

Hardens the /api/models/up health check against false positives caused by traffic inflating baselines. Re-enables google/gemini-3.1-pro-preview monitoring (removed in #2388) with proper protections.

Root cause (2026-04-13 incident): A single actor sent 1.7M requests (81.5% of model traffic) from 4 IPs across ~280 accounts. When they paused, the artificially high baseline made organic traffic look like a >90% drop, triggering a false alert.

Changes:

  • Adds COUNT(DISTINCT kilo_user_id) to the existing query (same GROUP BY requested_model, no new grouping dimension) to track user concentration
  • Requires ≥20 distinct users in the baseline window before alerting — prevents abuse actors from triggering drops
  • Adds 10s statement timeout with fail-open semantics (query timeout ≠ model outage) to protect against holding connections on the ~469M row table
  • Enriches response with uniqueUsersCurrent/uniqueUsersBaseline for investigator visibility
  • Changes error handler to fail open (return healthy: true) since DB errors aren't evidence of model outages

What this doesn't do (intentionally):

  • No per-user GROUP BY capping — too expensive on the largest table without a supporting index
  • No IP/user-agent filtering — requires joining microdollar_usage_metadata + normalized tables
  • No new indexes — current ~670ms query time is acceptable; will revisit if COUNT(DISTINCT) causes regression

Verification

  • pnpm typecheck — no new errors (pre-existing linkify-it error unrelated)
  • After deploy: verify gemini-3.1-pro-preview is back in the response at /api/models/up with uniqueUsersCurrent/uniqueUsersBaseline fields populated
  • Monitor queryExecutionTimeMs — target <2s, budget <5s

Visual Changes

N/A

Reviewer Notes

  • The COUNT(DISTINCT kilo_user_id) adds per-group hash sets, but the number of groups is small (~10 monitored models) so this should be bounded
  • The fail-open error handler is a deliberate change from the current fail-closed (healthy: false, 503) — rationale is that a DB timeout should not page on-call for model health
  • Threshold of 20 unique users (not 5) because the incident actor had ~280 accounts — need a threshold that represents genuinely broad organic usage

Re-enable gemini-3.1-pro-preview monitoring and add protections:

- Add COUNT(DISTINCT kilo_user_id) to detect concentrated traffic
- Require MIN_UNIQUE_USERS_FOR_ALERT (20) distinct users in baseline
  before alerting on drops — prevents abuse actors from inflating
  baselines with many accounts
- Add 10s statement timeout (fail-open) to prevent query from holding
  a connection on the largest table (~469M rows)
- Enrich response with uniqueUsersCurrent/uniqueUsersBaseline for
  investigator visibility
- Fail open on errors (timeout/DB) since those aren't evidence of
  model outages

Incident: 2026-04-13-gemini-3.1-pro-preview-monitor-abuse
@markijbema markijbema requested a review from RSO April 14, 2026 08:07
@markijbema markijbema marked this pull request as ready for review April 14, 2026 08:07
});

return NextResponse.json({ healthy: false }, { status: 503 });
// Fail open: a query timeout or DB error is not evidence of a model being down.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query timeout is currently 15 times actual time it takes to load

hasSignificantDrop = true;
// - Only alert if the baseline had enough distinct users to represent organic traffic

if (uniqueUsersBaseline >= MIN_UNIQUE_USERS_FOR_ALERT) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Distinct-user threshold does not block the incident pattern

The incident in the PR description already came from a single actor spread across ~280 accounts. Because unique_users_baseline is counting distinct kilo_user_ids over the whole baseline window, that actor still clears the >= 20 gate and the same pause will continue to look like an organic drop. To protect against the described false positive, this check needs a signal the actor cannot inflate by creating accounts (for example IP / JA4 / user-agent concentration, or a per-identity cap).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to look at this specific example as well, but this is probably a good step for now

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good step. As follow up, worth exploring a concentration based signal. Like you mentioned the IP join is expensive, but combining it with users would make the baseline less prone to these attacks since user-level alone wouldn't have caught this one (280 accounts).

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Apr 14, 2026

Code Review Summary

Status: 1 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File Line Issue
apps/web/src/app/api/models/up/route.ts 133 Distinct-user gating still allows a single actor controlling many accounts to inflate the baseline and trigger the same false alert.
Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

File Line Issue
Files Reviewed (1 files)
  • apps/web/src/app/api/models/up/route.ts - 1 issue

Reviewed by gpt-5.4-20260305 · 224,898 tokens

@markijbema markijbema merged commit d8017e6 into main Apr 14, 2026
15 checks passed
@markijbema markijbema deleted the mark/harden-models-up-health-check branch April 14, 2026 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants