Skip to content

feat: DuckDB-only paired-histogram backend (DRC-3390 Stage B)#1399

Open
danyelf wants to merge 8 commits into
mainfrom
feature/drc-3390-stage-b-backend-duckdb
Open

feat: DuckDB-only paired-histogram backend (DRC-3390 Stage B)#1399
danyelf wants to merge 8 commits into
mainfrom
feature/drc-3390-stage-b-backend-duckdb

Conversation

@danyelf
Copy link
Copy Markdown
Contributor

@danyelf danyelf commented May 26, 2026

Stage B — DuckDB-only paired-distribution backend + API contract

First runnable cut of the DRC-3390 paired-histograms work: the backend computes paired column distributions on DuckDB and returns a clean {status: "unsupported"} envelope for every other adapter. No UI yet — this PR is the backend plus the TypeScript API contract the rest of the stack builds on.

Stages: A = Storybook cells (#1398) · B = this PR · C = wire the UI to this contract · D = polyglot adapter coverage. Linear: DRC-3390.

What's in it

1. --inline-profile server flag

New flag (RECCE_INLINE_PROFILE), plumbed through RecceServerFlags.inline_profile. It gates the whole feature; off by default, so nothing changes for existing users.

2. Adapter capability layer + the profile task

Two new modules decide whether and how an adapter can compute approximate aggregates:

  • capabilities.py — a one-row table: duckdb → full tier, every other adapter → disabled. The strategy router short-circuits non-DuckDB to the unsupported envelope instead of emitting SQL the engine can't parse.
  • approx_aggregates.py — DuckDB renderers for approx_count_distinct / approx_quantile / approx_top_k; non-DuckDB raises UnsupportedAggregateError.

tasks/profile_distribution.py orchestrates the pipeline (HLL probe → percentile histograms → top-K), registered as RunType.PROFILE_DISTRIBUTION. Stage D adds the other adapters back to the table.

Per-env histogram edges (the key design call). Each env's histogram is built from its own quantiles — base_bin_edges + current_bin_edges, paired with base_density / current_density (density = Δfraction / span). The two edge arrays deliberately don't line up; Stage C overlays them on a shared axis. We only get approx percentiles back from the warehouse (not bin counts), so per-env edges are the only assumption-free rendering — and they're what lets base and current actually diverge. This snake_case wire shape maps to the camelCase baseBinEdges / currentBinEdges that #1398 consumes.

3. Run-type registry

profile_distribution is dispatchable but has no registry entry (no checklist title / icon / result view — its results feed schema-view cells). findByRunType now returns RegistryEntry | undefined, and the two dynamic callsites null-check. The registry stays the single source of truth for what's registered — no parallel run-type list.

Verifying it

Automated (runs in CI):

pytest tests/tasks/test_profile_distribution.py \
       tests/adapter/test_capabilities.py \
       tests/adapter/test_approx_aggregates.py
# 121 passing

Live — there's no UI yet, so drive the run API directly against a DuckDB project:

cd /path/to/duckdb-dbt-project
recce server      # API on :8000  (--inline-profile only gates the UI, not this path)

curl -s -X POST http://localhost:8000/api/runs \
  -H 'content-type: application/json' \
  -d '{"type":"profile_distribution","params":{"model":"orders"}}' | jq '.result'

Run against jaffle_shop_duckdb with a real base→current change (base = HEAD; current = a working-tree edit that drops returned orders and renames status), the per-env divergence comes through:

{
  "status": "ok", "strategy": "approx_all",
  "base_total": 999, "current_total": 903,          // returned orders dropped by current
  "columns": {
    "amount": {
      "kind": "histogram",
      "base_bin_edges":    [1, 7, 12.6, /**/ 97],
      "current_bin_edges": [100, 703, 1276, /**/ 9704],   // own quantiles per env — don't line up
      "base_density": [/* 11 */], "current_density": [/* 11, differs from base */]
    },
    "status": {
      "kind": "topk",                                // union of both envs' values
      "values": ["completed", "returned", /**/ "ORDER-completed", "ORDER-placed", /**/],
      "base_counts": null, "current_counts": null, "trimmed": false
    }
  }
}

Point the same project at a non-DuckDB target and the identical curl returns the envelope, not a crash:

{ "status": "unsupported", "reason": "Adapter 'snowflake' does not support APPROX_PERCENTILE.", "columns": {} }

Out of scope / follow-ups

  • UI / schema-view integration → Stage C
  • Polyglot SQL renderers + multi-row capability table → Stage D
  • Per-column totals on the topK payload (Stage D, not needed here) — the discrete cell's counts mode wants base_total / current_total as a column-wide denominator for proportions. DuckDB's approx_top_k returns no counts, so Stage B always renders in ranks mode, which doesn't use totals. The field only becomes necessary once count-returning adapters land in Stage D. (Raised in the feat: paired histogram cells in Storybook (DRC-3390 Stage A) #1398 review; recorded here so it isn't lost.)
  • Heavily-tied numeric columns render the tied point-mass as density 0 (no spike yet) — documented, follow-up later

…390 Stage B)

Stage B of the DRC-3390 paired-histograms restructure (see
``/Users/danyel/.claude-recce/plans/honestly-when-i-read-nested-candle.md``).
First end-to-end runnable cut: backend serves paired-histogram payloads on
DuckDB; everything else returns ``{status: "unsupported"}``. Stage D will
restore polyglot adapter coverage; Stage C wires the UI to the contract
this PR freezes.

Backend (Python)
- ``recce/adapter/capabilities.py`` — capability table reduced to one row
  (DuckDB = full tier). ``detect_capabilities()`` returns the disabled tier
  for every other adapter; the strategy router short-circuits with the
  unsupported envelope on those. Stage D restores the polyglot tiers.
- ``recce/adapter/approx_aggregates.py`` — ``render_approx_count_distinct``,
  ``render_approx_top_k``, ``render_approx_percentile`` keep only the
  DuckDB branch. Non-DuckDB adapter types raise
  ``UnsupportedAggregateError``. Deletes the Snowflake / BigQuery /
  Databricks / Spark / Athena / Trino / Presto / ClickHouse / Redshift
  renderer branches (~85 LOC) — Stage D restores them.
- ``recce/tasks/profile_distribution.py`` — full orchestration from PR 2,
  trimmed:
  * ``pick_strategy()`` returns ``approx_all`` for DuckDB, ``unsupported``
    otherwise. Deletes the ``percentile_only`` branch.
  * ``parse_topk_result()`` keeps only the DuckDB flat-list branch; raises
    ``ValueError`` for other adapter types. Deletes the Snowflake-pairs /
    BigQuery-struct / Databricks-struct / Trino-map branches (~70 LOC).
  * ``render_epoch_cast()`` keeps only DuckDB's ``epoch(col)``; raises for
    others. Deletes Snowflake / BigQuery / Databricks / Spark / Trino /
    Athena / Presto / ClickHouse / Redshift epoch conversions (~12 LOC).
  * ``_coerce_percentile_array()`` collapses to the flat-list path
    (BigQuery 101-element slice deleted).
  * DRC-3504 (timestamp epoch cast) and DRC-3507 (DuckDB rollback on
    per-column failure) fixes preserved verbatim.
  * In-memory memoization preserved verbatim.
  * Telemetry preserved; docstring corrected from "Emit a PostHog ..."
    to "Emit an Amplitude ..." (matches the ``log_performance`` →
    Amplitude HTTP route at recce/event/collector.py:19).
- ``--inline-profile`` CLI flag + ``RecceServerFlags.inline_profile``
  plumbing.
- ``RunType.PROFILE_DISTRIBUTION`` enum + ``dbt_supported_registry``
  registration so the task is wired through.

Tests (Python)
- ``tests/adapter/test_capabilities.py`` — DuckDB at full tier, everything
  else disabled. Drops the per-tier parametrization for Snowflake /
  BigQuery / etc. (returns in Stage D).
- ``tests/adapter/test_approx_aggregates.py`` — DuckDB-positive SQL
  snapshots + parametrized "every other adapter raises
  ``UnsupportedAggregateError``" coverage.
- ``tests/tasks/test_profile_distribution.py`` — DuckDB-targeted (continuous
  / low-card / high-card / UUID-cap-degenerate / timestamp DRC-3504
  regression / deliberately-bad-column DRC-3507 regression / memoization
  hit-miss / unsupported tier short-circuit / strategy router). Drops the
  multi-adapter strategy-router and per-adapter top-K shape tests (back
  in Stage D).

API contract (TypeScript)
- ``js/packages/ui/src/api/types/run.ts``:
  * Add ``"profile_distribution"`` to ``RunType``.
  * ``ProfileDistributionParams`` (``model`` + optional ``columns``).
  * ``ProfileDistributionResult`` discriminated union — all four variant
    cases expressed (``histogram`` / ``topk`` / ``null`` per-column failure
    / ``unsupported`` envelope) — Stage C narrows on ``kind`` and
    ``status``.
  * ``Run`` arm + ``isProfileDistributionRun`` guard.
  * Promote ``runTypeHasRef`` to a real type guard
    (``runType is RunTypeWithRef``).
  * Add ``RunTypeWithRef`` type.
- ``js/packages/ui/src/components/run/registry.ts``:
  * ``RegisteredRunType``, ``REGISTERED_RUN_TYPES``,
    ``isRegisteredRunType`` guard — backend-only run types like
    ``profile_distribution`` are deliberately out of the checklist
    registry.
  * ``findByRunType`` and ``createBoundFindByRunType`` narrow to
    ``RegisteredRunType``.
- ``js/packages/ui/src/components/run/types.ts``: ``RunTypeRegistry``
  keys on ``RegisteredRunType`` (was ``RunType``).
- ``js/packages/ui/src/api/flag.ts``: ``inline_profile`` boolean.
- Two callsites that pass arbitrary ``RunType`` get the
  ``isRegisteredRunType`` guard added: ``CheckDetailOss.tsx``,
  ``RecceActionAdapter.tsx``.

Cherry-picked test fixes (tied to RegisteredRunType, ride with this PR)
- ``js/src/components/run/__tests__/registry.test.ts`` — narrow
  ``RunType[]`` → ``RegisteredRunType[]``.
- ``js/src/lib/hooks/__tests__/RecceActionAdapter.test.tsx`` — stub
  ``isRegisteredRunType`` in the module mock so the adapter doesn't bail
  before calling submitRun.

Verification
- ``black --check ./recce ./tests`` + ``isort --check-only ./recce ./tests``
  clean.
- ``pytest tests/tasks/test_profile_distribution.py
  tests/adapter/test_capabilities.py tests/adapter/test_approx_aggregates.py``
  — 116 / 116 pass.
- Full ``pytest tests/`` (excluding 6 pre-existing SPA-route tests that
  require a built ``js/out/`` bundle not present in fresh worktrees) —
  1318 / 1318 pass.
- DuckDB smoke against ``main.orders`` in
  ``/Users/danyel/code/Recce/jaffle_shop_duckdb/jaffle_shop.duckdb``:
  continuous (amount) returns ``bin_edges[12]`` + ``base_density[11]`` +
  ``current_density[11]`` in 5.2ms; categorical (status) returns 5 values
  flat-list, counts=None; datetime (order_date) renders via
  ``epoch(col)`` cast.
- Snowflake smoke against ``snowflake_dev`` target (jaffle_shop_duckdb
  project): returns ``{status: "unsupported", reason: "Adapter 'snowflake'
  does not support APPROX_PERCENTILE.", columns: {}}`` end-to-end through
  ``ProfileDistributionTask.execute()``. This is the path the colleague
  will exercise to verify Stage B's non-DuckDB short-circuit.
- ``pnpm --filter @datarecce/ui type:check`` introduces zero new TS errors
  vs main (baseline CSS-modules / useTrackLineageRender errors preserved
  per recce-python-tooling memo).

Scope deletion summary
- ~85 LOC of non-DuckDB SQL renderers removed from approx_aggregates.py
- ~70 LOC of non-DuckDB top-K post-processor branches removed from
  profile_distribution.py
- ~12 LOC of non-DuckDB epoch casts removed from render_epoch_cast
- Capability table reduced from 14 rows to 1
- ~12 per-adapter renderer snapshot tests deleted
- ~5 multi-adapter strategy-router / top-K-shape tests deleted

Combined Stage B (1895 LOC across 6 files) vs PR 1 + PR 2 source
(2159 LOC across the same 6 files): 264 LOC trim, with most of the
remaining size in the unchanged orchestration / memoization / payload
builders (which Stage D will extend, not rewrite).

Closes #1389 and #1390 (in spirit — those PRs were closed as part of
the post-Ultraplan restructure; their branches stay as code-stash for
Stage D).

Refs DRC-3390.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Danyel Fisher <danyel@gmail.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2026

Codecov Report

❌ Patch coverage is 87.67471% with 97 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
recce/tasks/profile_distribution.py 78.47% 96 Missing ⚠️
recce/cli.py 66.66% 1 Missing ⚠️
Files with missing lines Coverage Δ
recce/adapter/approx_aggregates.py 100.00% <100.00%> (ø)
recce/adapter/capabilities.py 100.00% <100.00%> (ø)
recce/adapter/dbt_adapter/__init__.py 79.11% <ø> (+0.22%) ⬆️
recce/models/types.py 98.25% <100.00%> (+0.01%) ⬆️
recce/tasks/__init__.py 100.00% <100.00%> (ø)
tests/adapter/test_approx_aggregates.py 100.00% <100.00%> (ø)
tests/adapter/test_capabilities.py 100.00% <100.00%> (ø)
tests/tasks/test_profile_distribution.py 100.00% <100.00%> (ø)
recce/cli.py 68.88% <66.66%> (-0.01%) ⬇️
recce/tasks/profile_distribution.py 78.47% <78.47%> (ø)

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

danyelf and others added 4 commits May 28, 2026 14:23
…e-b-backend-duckdb

Signed-off-by: Danyel Fisher <danyel@gmail.com>

# Conflicts:
#	recce/cli.py
Shared bin edges collapsed base_density and current_density to identical
arrays — averaging the two envs' quantiles onto one grid threw away the
divergence the paired histogram exists to show, and broke the equal-area
property the renderer (PR #1398) depends on.

With only approx percentiles from the warehouse (no bin counts), the
honest rendering is each env on its OWN quantile edges:
density = (1/NUM_BINS)/span. Base and current edges deliberately do not
line up; Stage C overlays the two staircases on a shared value axis.

- _build_histogram_payload now emits base_bin_edges + current_bin_edges
  via new _env_edges_and_density helper (fraction-delta / width).
- API contract: ProfileDistributionHistogramPayload carries per-env edges.
- Drop now-unused _safe_min/_safe_max.
- Tests: assert base != current for divergent inputs, equal-area per bin,
  empty-without-bounds, and the parked point-mass (tie) behaviour.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Danyel Fisher <danyel@gmail.com>
… miss

The earlier cut introduced RegisteredRunType + REGISTERED_RUN_TYPES + an
isRegisteredRunType guard solely because findByRunType was typed to always
return an entry — so a run type without a registry entry (the backend-only
profile_distribution) had to be ruled out before every call. That created a
second run-type list that could drift from the registry.

Collapse it: a registry lookup can simply miss. findByRunType is now an
overload — literal registered keys keep their precise entry type; a dynamic
RunType returns RegistryEntry | undefined. The two dynamic callsites
(CheckDetailOss, RecceActionAdapter) null-check the result instead of guarding.

Removes RegisteredRunType, REGISTERED_RUN_TYPES, isRegisteredRunType and their
exports. profile_distribution stays a wire-level RunType with no registry
entry. One run-type list again; the registry is the single source of truth for
what's registered.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Danyel Fisher <danyel@gmail.com>
Added speculatively for Stage C but has zero consumers (every sibling guard
is used). Stage C can reintroduce it when it actually narrows on the type.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Danyel Fisher <danyel@gmail.com>
@danyelf danyelf marked this pull request as ready for review May 29, 2026 07:13
@even-wei
Copy link
Copy Markdown
Contributor

Code Review: PR #1399

SHA ec43d9e10 · Verdict NO-GO · Early review (Stages C & D to follow)

Blockers

None.

Issues

  1. recce/tasks/profile_distribution.py:579 — the empty-columns "ok" path omits base_total/current_total, which the frozen TS contract declares as required.
    Evidence: line 579 returns {"status": "ok", "strategy": strategy, "columns": {}}; the main path (lines 680-684) returns those two fields, and ProfileDistributionOkResult in js/packages/ui/src/api/types/run.ts:261-262 declares base_total: number; current_total: number (non-optional). Reachable via a columns=[…] filter that matches nothing, or a model whose columns are all skipped types. Stage C reading result.base_total gets undefined. No test covers this branch (tests/tasks/test_profile_distribution.py).
    Pass C / F (+ E).

Notes

  1. recce/tasks/profile_distribution.py:411 — datetime histogram edges are emitted as epoch seconds (the DRC-3504 epoch() cast), but ProfileDistributionHistogramPayload carries no marker separating numeric from datetime. Stage C must use out-of-band column type to format epoch values back to dates; worth recording in the frozen contract before C builds on it.
    Pass C.
  2. recce/tasks/profile_distribution.py:676 — cap-degenerate columns emit base_counts: [] / current_counts: [], while the normal DuckDB categorical path emits null (line 661 → parse_topk_result returns None). Same adapter, two "no counts" encodings; Stage C must treat [] and null identically.
    Pass F.
  3. recce/tasks/profile_distribution.py:551-561 — a model whose get_columns raises in both envs is swallowed to [] and returns status: "ok", columns: {} rather than surfacing an error. A nonexistent/broken model is indistinguishable from one with no profileable columns.
    Pass D.

What I verified as sound: the findByRunType overload + runTypeHasRef type-guard design (narrows run.type to RunTypeWithRef, a subset of registry keys, so RunResultPaneOss.tsx:147 / useRun.tsx:98 .RunResultView access stays non-undefined; the other dynamic callsites null-check). Histogram equal-area math, per-env edge divergence, tie→zero handling, DRC-3504 epoch cast, and the DRC-3507 rollback isolation are all covered by tests. The unsupported short-circuit, capability table, and SQL renderers (quantiles/k are constants, identifiers quoted via adapter.quote) carry no injection surface. All CI checks green on this SHA (Python 3.11–3.13, flake8, lint, type/build, CodeQL, codecov patch+project).

…unify no-counts encoding

Address PR #1399 review (even-wei):

- Issue 1: the empty-columns early return omitted base_total/current_total,
  which the frozen TS contract (ProfileDistributionOkResult) declares as
  required. Emit 0/0 (probe hasn't run) so Stage C never reads undefined.
  Add a regression test for the columns-filter-matches-nothing branch.
- Note 2: cap-degenerate columns now emit base_counts/current_counts as null
  (was []), matching the normal categorical no-counts encoding. One 'no
  counts' shape per adapter; empty values flags degeneracy.
- Note 1: document the datetime epoch-seconds edge encoding in the TS
  histogram payload contract so Stage C formats edges via out-of-band type.

Note 3 (get_columns error taxonomy) deferred to Stage D per review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Danyel Fisher <danyel@gmail.com>
@danyelf
Copy link
Copy Markdown
Contributor Author

danyelf commented May 29, 2026

Thanks for the early review @even-wei — addressed in 4009362e.

Stepping back, before I let the 🤖 say its bit -- looks like all the stuff was edge cases (columns with no items, empty columns). Indeed, I'm pretty sure that our main Histogram code crashes on datetime columns (or returns nothing); datetimes are annoying and need cleanup. Thanks for catching them!

--

Summary:

Issue 1 (:579, empty-columns "ok" path) — fixed. The early return now emits base_total: 0 / current_total: 0. The probe phase hasn't run at that point so the totals are genuinely unknown; 0 is the honest value and there are no columns to render against a denominator anyway. Added test_empty_columns_ok_payload_has_totals covering the columns=[…]-matches-nothing branch (asserts both totals present == 0).

Note 2 (:676, [] vs null counts) — fixed. Cap-degenerate columns now emit base_counts: null / current_counts: null, matching the normal categorical no-counts path. One "no counts" encoding per adapter; the empty values list is what flags degeneracy. Updated test_uuid_cap_degenerate_column accordingly.

Note 1 (:411, datetime epoch marker) — recorded in the contract. Added a doc block to ProfileDistributionHistogramPayload in run.ts spelling out that edges are always numeric but datetime columns carry epoch seconds (the DRC-3504 epoch() cast), with no in-payload marker — so Stage C must format edges using the column's out-of-band dbt type. No behavior change; this just freezes the assumption before C builds on it.

Note 3 (:551-561, get_columns raises → swallowed to []) — deferred to Stage D. Distinguishing a broken/nonexistent model from one with no profileable columns means deciding on an error envelope, which is Stage D's error-taxonomy scope. Left as-is intentionally; flagging here so it isn't lost.

All tests/tasks/test_profile_distribution.py pass (54); flake8 / black / isort / pnpm type:check clean.

@danyelf danyelf self-assigned this May 29, 2026
@danyelf danyelf requested a review from even-wei May 29, 2026 16:44
…DRC-3390 Stage B)

DuckDB's approx_top_k returns values with no counts and no per-env
membership, so the categorical paired-histogram cell (which needs real
counts OR real per-env ranks) had nothing to render. The cells were
designed for a ranks payload (base_ranks/current_ranks/k) that lived only
in the Storybook fixtures and never made it into the wire contract — the
A<->B drift surfaced at Stage C.

Promote the ranks variant into run.ts as the single source of truth:

- run.ts: add ProfileDistributionTopKRanksPayload (kind "topk", mode
  "ranks", base_ranks/current_ranks/k); counts payload gains an explicit
  mode "counts" discriminant and its counts arrays are no longer wholly
  nullable. Re-export the ProfileDistribution* types from the api barrel.
- profile_distribution.py: _build_topk_payload dispatches to a new
  _build_topk_ranks_payload whenever both envs lack counts (the DuckDB
  hot path); cap-degenerate columns emit the same ranks shape with empty
  arrays. Nothing emits wholly-null counts anymore.
- tests: assert the ranks shape on the low-cardinality + cap-degenerate
  paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Danyel Fisher <danyel@gmail.com>
@even-wei
Copy link
Copy Markdown
Contributor

Code Review: PR #1399

SHA 6f6e48e5 · Verdict NO-GO

Verified locally: Python 122 passed; frontend pnpm type:check exit 0; full frontend suite 3850 passed / 5 skipped; flake8 + black --check clean. The registry overload refactor (findByRunTypeRegistryEntry | undefined, runTypeHasRef as a RunTypeWithRef type guard) is sound — every dynamic call site is covered by either the type-guard narrowing or a null-check, and the registry test correctly switched from iterating RUN_TYPES to Object.keys(defaultRunTypeConfig). SQL-injection surface is closed (params resolve against the manifest / used as filters; identifiers quoted via adapter.quote(); aliases sanitized; quantiles/k range-validated).

Blockers

None.

Issues

  1. recce/tasks/profile_distribution.py:448 — Base-first [:k] union truncation drops all current-only values when base already fills k slots, so high-cardinality categorical columns whose top-K diverges between base and current render current's new top values as entirely absent — defeating the base-vs-current comparison the cell exists to show.
    Evidence: union = list(seen.keys())[:k] with base inserted first (lines 441-448); each env's top-K is SQL-capped at k=12 (approx_top_k(col, 12)). If len(base_values) >= 12, union = all 12 base values and every current_ranks entry is None. This contradicts the docstring's "values present only in current's top-K appended on the right." No test catches it: every create_model(...) in the suite passes csv, csv (identical base==current), and _build_topk_ranks_payload is never unit-imported — the divergence path that the histogram builder got a dedicated regression test for (test_histogram_payload_base_and_current_diverge) has zero coverage for top-K.
    Pass A / E / F.

Notes

  1. js/packages/ui/src/api/types/run.ts:191 (+ recce/tasks/profile_distribution.py:408) — The "frozen" histogram contract docs state edges "each hold 12 edges" / "NUM_BINS + 1 values," but _env_edges_and_density emits min + present-quantiles + max after sort, so tied/degenerate columns yield fewer (the type is correctly number[], and len(density) == len(edges) - 1 does hold). A Stage C consumer that indexes a fixed edges[11] on the doc's promise would be wrong. Pass C/F.
  2. recce/tasks/profile_distribution.py:604-611execute() swallows all exceptions from both get_columns() calls into [], so a connection/permission/infra failure returns the same {status: "ok", columns: {}} as a model with no profilable columns. Off-by-default behind --inline-profile and feeds a non-critical schema cell, but it leans false-quiet for infra errors (get_columns itself raises a descriptive RecceException that is discarded here). Pass D.
  3. recce/tasks/profile_distribution.py:489-547 — Stage-D-only latent contract drift: the counts-mode _build_topk_payload can set base_counts/current_counts to a whole-side None, but the TS ProfileDistributionTopKPayload types them as (number | null)[] and its doc claims they are "never wholly null"; it also unions current-first vs the ranks path's base-first. Dead in Stage B (DuckDB always → ranks); flagging so it isn't lost when count-returning adapters land in Stage D. Pass C/F.

Copy link
Copy Markdown
Contributor

@even-wei even-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review: NO-GO. 1 issue (top-K base-vs-current divergence dropped on truncation; untested) + 3 notes. Full review in the PR comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants