fix(models/up): harden health check against abuse-inflated baselines by markijbema · Pull Request #2392 · Kilo-Org/cloud

markijbema · 2026-04-14T07:34:51Z

Summary

Hardens the /api/models/up health check against false positives caused by traffic inflating baselines. Re-enables google/gemini-3.1-pro-preview monitoring (removed in #2388) with proper protections.

Root cause (2026-04-13 incident): A single actor sent 1.7M requests (81.5% of model traffic) from 4 IPs across ~280 accounts. When they paused, the artificially high baseline made organic traffic look like a >90% drop, triggering a false alert.

Changes:

Adds COUNT(DISTINCT kilo_user_id) to the existing query (same GROUP BY requested_model, no new grouping dimension) to track user concentration
Requires ≥20 distinct users in the baseline window before alerting — prevents abuse actors from triggering drops
Adds 10s statement timeout with fail-open semantics (query timeout ≠ model outage) to protect against holding connections on the ~469M row table
Enriches response with uniqueUsersCurrent/uniqueUsersBaseline for investigator visibility
Changes error handler to fail open (return healthy: true) since DB errors aren't evidence of model outages

What this doesn't do (intentionally):

No per-user GROUP BY capping — too expensive on the largest table without a supporting index
No IP/user-agent filtering — requires joining microdollar_usage_metadata + normalized tables
No new indexes — current ~670ms query time is acceptable; will revisit if COUNT(DISTINCT) causes regression

Verification

pnpm typecheck — no new errors (pre-existing linkify-it error unrelated)
After deploy: verify gemini-3.1-pro-preview is back in the response at /api/models/up with uniqueUsersCurrent/uniqueUsersBaseline fields populated
Monitor queryExecutionTimeMs — target <2s, budget <5s

Visual Changes

N/A

Reviewer Notes

The COUNT(DISTINCT kilo_user_id) adds per-group hash sets, but the number of groups is small (~10 monitored models) so this should be bounded
The fail-open error handler is a deliberate change from the current fail-closed (healthy: false, 503) — rationale is that a DB timeout should not page on-call for model health
Threshold of 20 unique users (not 5) because the incident actor had ~280 accounts — need a threshold that represents genuinely broad organic usage

Re-enable gemini-3.1-pro-preview monitoring and add protections: - Add COUNT(DISTINCT kilo_user_id) to detect concentrated traffic - Require MIN_UNIQUE_USERS_FOR_ALERT (20) distinct users in baseline before alerting on drops — prevents abuse actors from inflating baselines with many accounts - Add 10s statement timeout (fail-open) to prevent query from holding a connection on the largest table (~469M rows) - Enrich response with uniqueUsersCurrent/uniqueUsersBaseline for investigator visibility - Fail open on errors (timeout/DB) since those aren't evidence of model outages Incident: 2026-04-13-gemini-3.1-pro-preview-monitor-abuse

markijbema · 2026-04-14T08:09:11Z

apps/web/src/app/api/models/up/route.ts

    });

-    return NextResponse.json({ healthy: false }, { status: 503 });
+    // Fail open: a query timeout or DB error is not evidence of a model being down.


query timeout is currently 15 times actual time it takes to load

kilo-code-bot · 2026-04-14T08:10:09Z

apps/web/src/app/api/models/up/route.ts

-        hasSignificantDrop = true;
+      // - Only alert if the baseline had enough distinct users to represent organic traffic
+
+      if (uniqueUsersBaseline >= MIN_UNIQUE_USERS_FOR_ALERT) {


WARNING: Distinct-user threshold does not block the incident pattern

The incident in the PR description already came from a single actor spread across ~280 accounts. Because unique_users_baseline is counting distinct kilo_user_ids over the whole baseline window, that actor still clears the >= 20 gate and the same pause will continue to look like an organic drop. To protect against the described false positive, this check needs a signal the actor cannot inflate by creating accounts (for example IP / JA4 / user-agent concentration, or a per-identity cap).

I think we need to look at this specific example as well, but this is probably a good step for now

I think it is a good step. As follow up, worth exploring a concentration based signal. Like you mentioned the IP join is expensive, but combining it with users would make the baseline less prone to these attacks since user-level alone wouldn't have caught this one (280 accounts).

kilo-code-bot · 2026-04-14T08:10:40Z

Code Review Summary

Status: 1 Issues Found | Recommendation: Address before merge

Overview

Severity	Count
CRITICAL	0
WARNING	1
SUGGESTION	0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File	Line	Issue
`apps/web/src/app/api/models/up/route.ts`	133	Distinct-user gating still allows a single actor controlling many accounts to inflate the baseline and trigger the same false alert.

Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

File	Line	Issue

Files Reviewed (1 files)

apps/web/src/app/api/models/up/route.ts - 1 issue

_{Reviewed by gpt-5.4-20260305 · 224,898 tokens}

markijbema requested a review from RSO April 14, 2026 08:07

markijbema marked this pull request as ready for review April 14, 2026 08:07

markijbema commented Apr 14, 2026

View reviewed changes

kilo-code-bot bot reviewed Apr 14, 2026

View reviewed changes

chrarnoldus approved these changes Apr 14, 2026

View reviewed changes

markijbema merged commit d8017e6 into main Apr 14, 2026
15 checks passed

markijbema deleted the mark/harden-models-up-health-check branch April 14, 2026 08:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(models/up): harden health check against abuse-inflated baselines#2392

fix(models/up): harden health check against abuse-inflated baselines#2392
markijbema merged 1 commit intomainfrom
mark/harden-models-up-health-check

markijbema commented Apr 14, 2026 •

edited

Loading

Uh oh!

markijbema Apr 14, 2026

Uh oh!

kilo-code-bot bot Apr 14, 2026

Uh oh!

markijbema Apr 14, 2026

Uh oh!

johnnyeric Apr 14, 2026

Uh oh!

kilo-code-bot bot commented Apr 14, 2026 •

edited

Loading

WARNING

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

markijbema commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Visual Changes

Reviewer Notes

Uh oh!

markijbema Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

markijbema Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

johnnyeric Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Overview

WARNING

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

markijbema commented Apr 14, 2026 •

edited

Loading

kilo-code-bot bot commented Apr 14, 2026 •

edited

Loading