SQL classifier false-positives on common English prose

## Problem

The SQL content detector in `src/classify.ts` triggers false-positive T0 classifications on pure English prose that happens to contain words shared with SQL keywords. This causes messages that should be compressible (T3) to be preserved verbatim, resulting in 0% compression on prose-heavy conversations.

### Root causes

1. **`CASCADE` is a strong anchor** — a single occurrence of the English word "cascade" (e.g. "could cascade latency across the entire checkout flow") is enough to trigger `sql_content`, since strong anchors bypass the 3+ keyword threshold.

2. **`DECLARE` is a weak anchor** — combined with other common English words that match SQL keywords (`SCHEMA`, `INTO`, `FROM`), it easily reaches the 3+ distinct keyword threshold. "Consumers declare which versions they understand" is normal technical prose.

3. **`VALUES`, `OFFSET`** are weak anchors that appear naturally in technical writing ("applying absolute values instead of deltas", "advances the consumer offset").

4. **`FROM`** and **`SCHEMA`** are non-anchor keywords in `SQL_ALL_RE` that inflate the distinct keyword count, making it easier for a single weak anchor to trigger the rule.

### Impact

The `technicalExplanation` benchmark scenario (pure architectural prose, zero code/SQL/structured data) gets classified as T0 across most messages, producing **1.00x compression** when it should be one of the best candidates for deterministic summarization.

### Reproduction

Compress the following conversation with `{ recencyWindow: 4, preserve: ['system'] }`. Expected: ratio > 1.0. Actual: ratio = 1.00, 0 messages compressed.

```
system: You are a principal engineer explaining system design to a new team lead.

user: Can you walk me through how our event-driven architecture works and why we chose it over a traditional request-response model?

assistant: The event-driven architecture we adopted grew out of a series of scaling problems we encountered roughly two years ago when the monolith started buckling under peak traffic. The core idea is that services communicate through immutable events published to a central log rather than making synchronous calls to each other. When the order service completes a checkout, it publishes an OrderPlaced event. The inventory service, the notification service, and the analytics pipeline each consume that event independently and at their own pace. None of them need to know about the others, and none of them can block the order service from finishing its work.

The biggest advantage is temporal decoupling. Under the old model, a single slow downstream dependency could cascade latency across the entire checkout flow. A payment verification that normally took fifty milliseconds would occasionally spike to several seconds, and because every call was synchronous, the user sat waiting while six services completed in sequence. With events, the order service does its own work, publishes, and returns. Downstream processing happens asynchronously, so the user experience stays fast even when one consumer is struggling.

The trade-off is eventual consistency. In a synchronous model, you get immediate confirmation that every step completed. In an event-driven model, you have to design for the possibility that the inventory adjustment or the email notification has not happened yet when the user sees their confirmation page. We handle this through careful UX design and through idempotent consumers that can safely replay events if something goes wrong.

user: What about the event schema evolution problem? How do we handle changes to event formats without breaking all the consumers?

assistant: Schema evolution is one of the hardest problems in event-driven systems and we spent considerable time getting our approach right. We settled on a strategy that combines explicit versioning with backward-compatible evolution rules. Every event type carries a schema version number in its header. Consumers declare which versions they understand, and the platform routes accordingly.

For non-breaking changes like adding a new optional field, we increment the minor version. Consumers that do not understand the new field simply ignore it. This follows the robustness principle of being conservative in what you send and liberal in what you accept. The vast majority of our schema changes fall into this category, and they require zero coordination between teams.

For breaking changes like removing a field or changing a field type, we increment the major version and run both the old and new versions in parallel during a migration window. The producing service publishes to both versions, and consuming teams migrate on their own schedule within the agreed window. Once all consumers have migrated, we sunset the old version.

We also maintain a central schema registry that validates events at publish time. If a service tries to publish an event that does not conform to its declared schema, the publish fails fast with a clear error. This catches drift between the code and the schema before it reaches consumers.

user: How do we handle ordering guarantees? I noticed some of our consumers seem to process events out of order.

assistant: Ordering is subtle in distributed event systems and the right answer depends on what level of ordering your consumer actually needs. Our event platform guarantees ordering within a partition, but not across partitions. Events for the same aggregate, like all events for a specific order, go to the same partition based on the aggregate identifier, so they arrive in the order they were published. Events for different aggregates may arrive in any order because they live on different partitions processed by different consumer instances.

Most consumers only care about ordering within an aggregate and they get it for free from the partitioning scheme. The inventory service needs to see OrderPlaced before OrderCancelled for the same order, and it does because both events share the same order identifier as their partition key.

For the rare cases where a consumer needs cross-aggregate ordering, we use a sequence number stamped by the producer at publish time. The consumer tracks the last processed sequence number and detects gaps. When it encounters a gap, it buffers the event and waits briefly for the missing one to arrive. If the missing event does not appear within the timeout window, the consumer raises an alert and proceeds with compensating logic.

The consumers you noticed processing out of order are likely reading from a topic that mixes events from multiple aggregates. Check whether the partition key is set correctly on the producer side. A common mistake is using a random or missing partition key, which causes round-robin assignment and destroys any ordering guarantee.

user: What about failure handling? If a consumer crashes mid-processing, do we lose events?

assistant: No, the platform guarantees at-least-once delivery. When a consumer reads an event, it does not acknowledge the read until processing completes successfully. If the consumer crashes before acknowledging, the event remains available for redelivery. The next time the consumer starts up, or when another instance in the consumer group takes over the partition, it picks up from the last acknowledged position.

The flip side of at-least-once delivery is that consumers must be idempotent. They will occasionally see the same event twice, particularly after a crash or during a rebalance. We enforce idempotency through a combination of techniques. Some use a deduplication table that records processed event identifiers and skips any event it has seen before. Others design their state mutations to be naturally idempotent, for example using upserts instead of inserts or applying absolute values instead of deltas.

Dead letter queues catch events that fail processing repeatedly. After a configurable number of retries with exponential backoff, the platform moves the event to a dead letter topic and advances the consumer offset so it does not block subsequent events. The operations team monitors the dead letter queue and investigates failures.
```

### Suggested improvements

1. **Demote `CASCADE` from strong to weak anchor** — it's a common English verb ("failures cascade", "cascade of errors"). Unlike `VARCHAR` or `PRIMARY KEY`, it appears regularly in non-SQL technical writing.

2. **Demote `DECLARE`** from weak anchor or remove it — "declare" is standard English ("consumers declare their intent", "declare a variable"). Consider requiring it to appear near other strong SQL signals.

3. **Raise the weak-anchor threshold** — currently 3+ distinct keywords with 1 weak anchor triggers detection. Consider requiring 4+ distinct keywords, or 2+ weak anchors, to reduce false positives on prose that naturally uses words like "from", "into", "schema".

4. **Add a prose density counter-signal** — if the text has high prose density (long sentences, low special-char ratio, no code patterns), discount the SQL score. The structural pattern detector already computes these metrics but they don't feed back into the SQL decision.

5. **Consider a "SQL statement structure" heuristic** — real SQL has characteristic structure (keywords at line starts, comma-separated column lists, semicolons). Checking for at least one line that *looks* like a SQL statement would dramatically reduce false positives on flowing prose.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQL classifier false-positives on common English prose #6

Problem

Root causes

Impact

Reproduction

Suggested improvements

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

SQL classifier false-positives on common English prose #6

Description

Problem

Root causes

Impact

Reproduction

Suggested improvements

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions