diff --git a/.claude/skills/review-pr/SKILL.md b/.claude/skills/review-pr/SKILL.md
index 780b2b9d..9f0362a0 100644
--- a/.claude/skills/review-pr/SKILL.md
+++ b/.claude/skills/review-pr/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: review-pr
-description: Review a GitHub pull request against QuestDB coding standards
+description: Review a GitHub pull request against QuestDB coding standards. Performs an adversarial, blocking, mission-critical code review covering correctness, concurrency, performance, resource management, test coverage, test efficacy, test-code quality, and QuestDB conventions, then verifies every finding against source before reporting.
 argument-hint: [PR number or URL] [--level=0..3]
 allowed-tools: Bash(gh *), Read, Grep, Glob, Agent
 ---
@@ -36,10 +36,10 @@ The level controls how much of the review below actually runs. Lower levels keep
 
 | Level | What runs |
 |-------|-----------|
-| **0 (default)** | Steps 1, 2, 4. Skip Step 2.5. Skip Step 3 — no agent spawn; review the diff inline in the main loop, using Read/Grep on demand to resolve ambiguities. Skip Step 3b — verify each finding inline as you write it. Single-pass review covering correctness, NULL handling, tests, and QuestDB standards on the diff itself. |
-| **1** | Adds Step 2.5a (semantic delta only — skip 2.5b/2.5c/2.5d). In Step 3, launch only Agent 1 (correctness), Agent 5 (tests), and Agent 6 (code quality) in parallel. Skip all other agents. Skip Step 3b — verify findings inline as you draft the report. |
-| **2** | Full Step 2.5, but in 2.5b restrict the callsite inventory to `public`/`protected` symbols (skip package-private and `pub(crate)`). In Step 3, launch Agents 1-7, plus Agent 8 if `.rs` files are present. Skip Agent 9 (cross-context) and Agent 10 (adversarial fresh-context). Step 3b uses a single batched verification agent for all findings instead of one per finding. |
-| **3** | Every step below as written, all 10 agents, per-finding verification. The full mission-critical pass. |
+| **0 (default)** | Steps 1, 2, 4. Skip Step 2.5. Skip Step 3 — no agent spawn; review the diff inline in the main loop, using Read/Grep on demand to resolve ambiguities. Skip Step 3b — verify each finding inline as you write it. Single-pass review covering correctness, NULL handling, test coverage, and QuestDB standards on the diff itself. When the diff touches test code, also apply the test-efficacy and test-code-quality anti-pattern checks inline (vacuous assertions, reflection overuse, reinvented helpers, javadoc bloat). |
+| **1** | Adds Step 2.5a (semantic delta only — skip 2.5b/2.5c/2.5d) plus Step 2.5e when test code is present. In Step 3, launch Agent 1 (correctness), Agent 5 (test coverage), Agent 6 (code quality), and — when the diff touches test code — Agent 11 (test efficacy) and Agent 12 (test-code quality) in parallel. Skip all other agents. Skip Step 3b — verify findings inline as you draft the report. |
+| **2** | Full Step 2.5 (including 2.5e when test code is present), but in 2.5b restrict the callsite inventory to `public`/`protected` symbols (skip package-private and `pub(crate)`). In Step 3, launch Agents 1-7, plus Agent 8 if `.rs` files are present, plus Agents 11 and 12 when the diff touches test code. Skip Agent 9 (cross-context), Agent 10 (adversarial fresh-context), and Agent 13 (regression-test efficacy verification). Step 3b uses a single batched verification agent for all findings instead of one per finding. |
+| **3** | Every step below as written, all 13 agents, per-finding verification. The full mission-critical pass. |
 
 State the chosen level in one line at the start of the review so the user knows what they're getting (e.g., "Reviewing PR #1234 at level 2"). If the level was defaulted, mention that level 3 exists for full review.
 
@@ -119,6 +119,14 @@ End this step with an explicit list of "places this change is visible from but t
 
 The list groups the callsites from 2.5b by execution context: hot data paths, SQL compilation, async runtime, JNI boundary, replication, materialized views, parallel execution workers, etc. Every entry on this list must be reviewed in Step 3.
 
+### 2.5e Test surface & helper inventory
+
+Run this only when the PR adds or changes test code. It is the test-code counterpart to 2.5b and feeds Agents 11-13. Use real Grep/Glob searches — do not reason about helpers from memory.
+
+- **Existing-infrastructure inventory:** search the changed test files' package and module for base test classes, shared `@Before`/`@After`, helper methods, fixtures, and assertion utilities the new tests could reuse (Grep for `extends Abstract.*Test`, `class .*TestUtils`, `assertMemoryLeak`, `assertQuery`, `assertSql`, shared `protected` helpers in the base class). This list is the baseline Agent 12 uses to flag reinvented boilerplate — a "you stamped boilerplate instead of reusing helper X" finding requires X to appear in this inventory.
+- **Changed shared helpers as symbols:** if the PR changes a shared test base class, helper, or fixture, run the 2.5b callsite inventory for it too — a changed test base class can silently break every subclassing test.
+- **Exercised-symbol map:** for each new or changed test, list which production symbols from 2.5a it actually exercises, so Agents 11 and 13 can check efficacy and regression value.
+
 ## Step 3: Parallel review
 
 Every agent receives:
@@ -144,7 +152,7 @@ Launch the following agents in parallel.
 
 **Agent 4 — Resource management:** Leaks on all code paths (especially errors), try-with-resources, native memory, pool management. Walk every callsite from 2.5b that constructs, owns, or transfers ownership of changed types and verify cleanup on all paths.
 
-**Agent 5 — Test review & coverage:** Coverage gaps, error path tests, NULL tests, boundary conditions, regression tests, test quality, `assertMemoryLeak()` usage. Cross-reference 2.5d: every cross-context exposure should have a test that exercises the changed symbol from that context. Missing tests for cross-context callsites is a high-priority finding.
+**Agent 5 — Test coverage:** Coverage gaps, error path tests, NULL tests, boundary conditions, regression tests exist, `assertMemoryLeak()` usage. Cross-reference 2.5d: every cross-context exposure should have a test that exercises the changed symbol from that context. Missing tests for cross-context callsites is a high-priority finding. Test *efficacy* (whether those tests actually exercise the change and could fail) and test-*code* quality are handled by Agents 11-13 — here focus only on whether coverage exists for every new or changed path.
 
 **Agent 6 — Code quality & standards:** Code smell, member ordering, naming conventions, modern Java features, dead code, third-party dependencies.
 
@@ -181,6 +189,25 @@ The point of this agent is to surface bugs the structured agents cannot see beca
 
 Run this agent in parallel with agents 1-9. It is mandatory regardless of diff size.
 
+**Test-code agents (Agents 11-13) — run only when the diff adds or changes test code.** Launch them in the same parallel batch as agents 1-10. Each receives the diff, the change surface map, and the test surface inventory from 2.5e. They are the test-code counterparts to the production agents: Agent 11 mirrors Agent 1 (correctness), Agent 12 mirrors Agent 6 (code quality), and Agent 13 verifies regression-test efficacy. Tests are not second-class code — apply the same adversarial rigor here as to production.
+
+**Agent 11 — Test efficacy & correctness (adversarial):** Prove each test actually exercises the production change and could fail if that change regressed.
+- **Vacuous assertions:** flag every assertion that cannot fail — `assertTrue(true)`, `assertFalse(false)`, `assertEquals(x, x)`, asserting a literal against the same literal, asserting on a value the test itself just hard-coded, or a `@Test` body with no assertion and no `expected=`/`assertThrows`.
+- **Tests that don't reach the changed code:** the assertion passes whether or not the production change is present. Trace the data flow from the changed symbol to the assertion.
+- **Happy-path-only:** no assertion on the error/exception/NULL path the production change added.
+- **Concurrency-test correctness:** races in the test harness itself, missing latches/barriers, an `AssertionError` thrown on a spawned thread where it is swallowed instead of failing the test, `Thread.sleep`-based synchronization that is timing-dependent and flaky.
+- **Test setup/teardown resource handling:** native memory allocated in setup/`@Before` that leaks on a failing path, missing `assertMemoryLeak()` wrapping.
+- Each finding states the exact assertion and why it cannot fail or what it fails to cover.
+
+**Agent 12 — Test-code quality & maintainability:** Review the test as code.
+- **Reflection overuse:** flag `setAccessible(true)`, `getDeclaredField`/`getDeclaredMethod`, `Field.set`, `Class.forName`, and similar when a public API, an existing test helper, or a constructor reaches the same state. Reflection in tests is a last resort; if a neater non-reflective path exists, the reflection is a finding — name the alternative.
+- **No code reuse / boilerplate stamping:** before accepting repeated setup or assertion blocks, run Grep/Glob for existing helpers, base test classes, and fixtures (e.g., `extends Abstract.*Test`, `TestUtils`, `*TestUtils`, shared `assert*`, shared `@Before`) using the 2.5e inventory. If a helper already exists that the new test reimplements inline, flag it and name the helper. Duplicated blocks across new test methods that should be a single helper or a parameterized test are findings.
+- **Javadoc bloat:** flag multi-paragraph javadoc on `@Test` methods, javadoc that merely restates the test name, and stacked/duplicated javadoc ("javadoc piled on javadoc"). Test intent belongs in a precise test name plus, at most, a one-line comment.
+- **Residue and smells:** dead code, commented-out code, copy-paste leftovers (a `testFoo` that actually tests bar), `System.out.println` debugging, `@Ignore` without a referenced ticket, magic numbers >= 5 digits without `_` separators.
+- **Which standards apply:** zero-GC and `io.questdb.std`-over-`java.util` do NOT apply to test code — do not flag `java.util` collections or allocations in tests. Member ordering, `is/has` boolean naming, and SQL style DO apply.
+
+**Agent 13 — Regression-test efficacy verification:** For any PR that claims to fix a bug, verify the regression test would actually fail without the production change. Reason about reverting the production hunk and confirm the new or changed test's assertions would then fail. If the test still passes with the fix reverted, it is not a regression test — flag it. State, per test, which production line the test depends on and what its assertion would do if that line were reverted. Run only when the PR is a fix; skip for pure features or refactors.
+
 Combine all agent findings into a single deduplicated **draft** report. Do NOT present this draft to the user yet — it goes straight into verification.
 
 ## Step 3b: Verify every finding against source code
@@ -206,7 +233,9 @@ For each finding in the draft report:
    saving is negligible relative to the surrounding work. Exception: GC allocations on a hot path are always worth
    flagging, even a single one.
 9. **For cross-context findings (Agent 9)**: re-read the callsite in full, including its callers up two levels, and confirm the broken behavior is reachable from production code paths. Cross-context findings are high-value but also the easiest to overstate — verify carefully.
-10. **Classify each finding** as:
+10. **For test-efficacy findings (Agents 11, 13)**: re-read the cited assertion in full context and confirm it truly cannot fail — a "vacuous assertion" claim is a false positive if production code actually recomputes the asserted value. For "would pass without the fix" claims, trace what the assertion observes against the reverted production hunk before reporting.
+11. **For test-code-quality findings (Agent 12)**: confirm a flagged reflective access really has a non-reflective alternative (some QuestDB internals genuinely require reflection in tests) before reporting it. Confirm a "reinvented helper" finding by actually locating the helper with Grep and checking its signature fits the test's need.
+12. **Classify each finding** as:
     - **CONFIRMED in-diff** — the bug is real and inside the diff
     - **CONFIRMED at out-of-diff callsite** — the bug is in an unchanged file because the changed symbol is used there in a way that's now broken (cite the file and the contract from 2.5c that was violated)
     - **FALSE POSITIVE** — the code is actually correct (explain why)
@@ -287,6 +316,14 @@ Review the diff for:
 - **Regression tests:** If this PR fixes a bug, is there a test that reproduces the original bug and would fail without the fix?
 - Use Grep/Glob to find existing test files for the changed classes and verify they cover the new behavior.
 
+### Test code quality
+- **No vacuous assertions.** Every assertion must be able to fail. Flag `assertTrue(true)`, `assertFalse(false)`, `assertEquals(x, x)`, asserting a literal against the same literal, or a `@Test` body with no assertion and no `expected=`/`assertThrows`.
+- **Reflection is a last resort.** Flag `setAccessible(true)`, `getDeclaredField`/`getDeclaredMethod`, `Field.set`, `Class.forName` when a public API, existing helper, or constructor would reach the same state. Name the non-reflective path.
+- **Reuse before reinventing.** Search for existing helpers, base classes, and fixtures before accepting inline setup. Duplicated setup/assert blocks an existing helper or a parameterized test would cover are findings; name the helper.
+- **No javadoc bloat.** No multi-paragraph javadoc on `@Test` methods, no javadoc that restates the test name, no stacked/duplicated javadoc. Prefer a precise test name and at most a one-line comment.
+- **Test-appropriate standards.** zero-GC and `io.questdb.std`-over-`java.util` rules do NOT apply to tests — do not flag them there. Member ordering, `is/has` naming, and SQL style DO apply.
+- **No debugging residue.** No `System.out.println`, no commented-out code, no `@Ignore` without a referenced ticket.
+
 ### Unresolved TODOs and FIXMEs
 - Scan the diff for `TODO`, `FIXME`, `HACK`, `XXX`, and `WORKAROUND` comments. For each one found:
   - Is it a pre-existing comment that was just moved/reformatted, or newly introduced in this PR?
diff --git a/.pi/skills/review-pr/SKILL.md b/.pi/skills/review-pr/SKILL.md
index 312cd599..7a2767c6 100644
--- a/.pi/skills/review-pr/SKILL.md
+++ b/.pi/skills/review-pr/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: review-pr
-description: Review a GitHub pull request against QuestDB coding standards. Use when asked to review a PR (by number or URL), optionally with a depth level 0..3. Performs an adversarial, blocking, mission-critical code review covering correctness, concurrency, performance, resource management, tests, and QuestDB conventions, then verifies every finding against source before reporting.
+description: Review a GitHub pull request against QuestDB coding standards. Use when asked to review a PR (by number or URL), optionally with a depth level 0..3. Performs an adversarial, blocking, mission-critical code review covering correctness, concurrency, performance, resource management, test coverage, test efficacy, test-code quality, and QuestDB conventions, then verifies every finding against source before reporting.
 allowed-tools: bash read subagent
 ---
 
@@ -45,10 +45,10 @@ The level controls how much of the review below actually runs. Lower levels keep
 
 | Level | What runs |
 |-------|-----------|
-| **0 (default)** | Steps 1, 2, 4. Skip Step 2.5. Skip Step 3 — no subagent spawn; review the diff inline in the main loop, using `read`/`bash` searches on demand to resolve ambiguities. Skip Step 3b — verify each finding inline as you write it. Single-pass review covering correctness, NULL handling, tests, and QuestDB standards on the diff itself. |
-| **1** | Adds Step 2.5a (semantic delta only — skip 2.5b/2.5c/2.5d). In Step 3, launch only reviewer 1 (correctness), reviewer 5 (tests), and reviewer 6 (code quality) in parallel. Skip all other reviewers. Skip Step 3b — verify findings inline as you draft the report. |
-| **2** | Full Step 2.5, but in 2.5b restrict the callsite inventory to `public`/`protected` symbols (skip package-private and `pub(crate)`). In Step 3, launch reviewers 1-7, plus reviewer 8 if `.rs` files are present. Skip reviewer 9 (cross-context) and reviewer 10 (adversarial fresh-context). Step 3b uses a single batched verification reviewer for all findings instead of one per finding. |
-| **3** | Every step below as written, all 10 reviewers, per-finding verification. The full mission-critical pass. |
+| **0 (default)** | Steps 1, 2, 4. Skip Step 2.5. Skip Step 3 — no subagent spawn; review the diff inline in the main loop, using `read`/`bash` searches on demand to resolve ambiguities. Skip Step 3b — verify each finding inline as you write it. Single-pass review covering correctness, NULL handling, test coverage, and QuestDB standards on the diff itself. When the diff touches test code, also apply the test-efficacy and test-code-quality anti-pattern checks inline (vacuous assertions, reflection overuse, reinvented helpers, javadoc bloat). |
+| **1** | Adds Step 2.5a (semantic delta only — skip 2.5b/2.5c/2.5d) plus Step 2.5e when test code is present. In Step 3, launch reviewer 1 (correctness), reviewer 5 (test coverage), reviewer 6 (code quality), and — when the diff touches test code — reviewer 11 (test efficacy) and reviewer 12 (test-code quality) in parallel. Skip all other reviewers. Skip Step 3b — verify findings inline as you draft the report. |
+| **2** | Full Step 2.5 (including 2.5e when test code is present), but in 2.5b restrict the callsite inventory to `public`/`protected` symbols (skip package-private and `pub(crate)`). In Step 3, launch reviewers 1-7, plus reviewer 8 if `.rs` files are present, plus reviewers 11 and 12 when the diff touches test code. Skip reviewer 9 (cross-context), reviewer 10 (adversarial fresh-context), and reviewer 13 (regression-test efficacy verification). Step 3b uses a single batched verification reviewer for all findings instead of one per finding. |
+| **3** | Every step below as written, all 13 reviewers, per-finding verification. The full mission-critical pass. |
 
 State the chosen level in one line at the start of the review so the user knows what they're getting (e.g., "Reviewing PR #1234 at level 2"). If the level was defaulted, mention that level 3 exists for full review.
 
@@ -128,6 +128,14 @@ End this step with an explicit list of "places this change is visible from but t
 
 The list groups the callsites from 2.5b by execution context: hot data paths, SQL compilation, async runtime, JNI boundary, replication, materialized views, parallel execution workers, etc. Every entry on this list must be reviewed in Step 3.
 
+### 2.5e Test surface & helper inventory
+
+Run this only when the PR adds or changes test code. It is the test-code counterpart to 2.5b and feeds Reviewers 11-13. Use real `rg`/`find` searches via `bash` — do not reason about helpers from memory.
+
+- **Existing-infrastructure inventory:** search the changed test files' package and module for base test classes, shared `@Before`/`@After`, helper methods, fixtures, and assertion utilities the new tests could reuse (`rg` for `extends Abstract.*Test`, `class .*TestUtils`, `assertMemoryLeak`, `assertQuery`, `assertSql`, shared `protected` helpers in the base class). This list is the baseline Reviewer 12 uses to flag reinvented boilerplate — a "you stamped boilerplate instead of reusing helper X" finding requires X to appear in this inventory.
+- **Changed shared helpers as symbols:** if the PR changes a shared test base class, helper, or fixture, run the 2.5b callsite inventory for it too — a changed test base class can silently break every subclassing test.
+- **Exercised-symbol map:** for each new or changed test, list which production symbols from 2.5a it actually exercises, so Reviewers 11 and 13 can check efficacy and regression value.
+
 ## Step 3: Parallel review
 
 Launch the reviewers below with the `subagent` tool in `context: "fresh"` mode, in parallel (`subagent({ tasks: [...], context: "fresh", concurrency: N })`). Every reviewer task must include:
@@ -155,7 +163,7 @@ Launch the following reviewers in parallel.
 
 **Reviewer 4 — Resource management:** Leaks on all code paths (especially errors), try-with-resources, native memory, pool management. Walk every callsite from 2.5b that constructs, owns, or transfers ownership of changed types and verify cleanup on all paths.
 
-**Reviewer 5 — Test review & coverage:** Coverage gaps, error path tests, NULL tests, boundary conditions, regression tests, test quality, `assertMemoryLeak()` usage. Cross-reference 2.5d: every cross-context exposure should have a test that exercises the changed symbol from that context. Missing tests for cross-context callsites is a high-priority finding.
+**Reviewer 5 — Test coverage:** Coverage gaps, error path tests, NULL tests, boundary conditions, regression tests exist, `assertMemoryLeak()` usage. Cross-reference 2.5d: every cross-context exposure should have a test that exercises the changed symbol from that context. Missing tests for cross-context callsites is a high-priority finding. Test *efficacy* (whether those tests actually exercise the change and could fail) and test-*code* quality are handled by Reviewers 11-13 — here focus only on whether coverage exists for every new or changed path.
 
 **Reviewer 6 — Code quality & standards:** Code smell, member ordering, naming conventions, modern Java features, dead code, third-party dependencies.
 
@@ -192,6 +200,25 @@ The point of this reviewer is to surface bugs the structured reviewers cannot se
 
 Run this reviewer in parallel with reviewers 1-9. It is mandatory regardless of diff size.
 
+**Test-code reviewers (Reviewers 11-13) — run only when the diff adds or changes test code.** Launch them in the same parallel batch as reviewers 1-10. Each receives the diff, the change surface map, and the test surface inventory from 2.5e. They are the test-code counterparts to the production reviewers: Reviewer 11 mirrors Reviewer 1 (correctness), Reviewer 12 mirrors Reviewer 6 (code quality), and Reviewer 13 verifies regression-test efficacy. Tests are not second-class code — apply the same adversarial rigor here as to production.
+
+**Reviewer 11 — Test efficacy & correctness (adversarial):** Prove each test actually exercises the production change and could fail if that change regressed.
+- **Vacuous assertions:** flag every assertion that cannot fail — `assertTrue(true)`, `assertFalse(false)`, `assertEquals(x, x)`, asserting a literal against the same literal, asserting on a value the test itself just hard-coded, or a `@Test` body with no assertion and no `expected=`/`assertThrows`.
+- **Tests that don't reach the changed code:** the assertion passes whether or not the production change is present. Trace the data flow from the changed symbol to the assertion.
+- **Happy-path-only:** no assertion on the error/exception/NULL path the production change added.
+- **Concurrency-test correctness:** races in the test harness itself, missing latches/barriers, an `AssertionError` thrown on a spawned thread where it is swallowed instead of failing the test, `Thread.sleep`-based synchronization that is timing-dependent and flaky.
+- **Test setup/teardown resource handling:** native memory allocated in setup/`@Before` that leaks on a failing path, missing `assertMemoryLeak()` wrapping.
+- Each finding states the exact assertion and why it cannot fail or what it fails to cover.
+
+**Reviewer 12 — Test-code quality & maintainability:** Review the test as code.
+- **Reflection overuse:** flag `setAccessible(true)`, `getDeclaredField`/`getDeclaredMethod`, `Field.set`, `Class.forName`, and similar when a public API, an existing test helper, or a constructor reaches the same state. Reflection in tests is a last resort; if a neater non-reflective path exists, the reflection is a finding — name the alternative.
+- **No code reuse / boilerplate stamping:** before accepting repeated setup or assertion blocks, run `rg`/`find` for existing helpers, base test classes, and fixtures (e.g., `extends Abstract.*Test`, `TestUtils`, `*TestUtils`, shared `assert*`, shared `@Before`) using the 2.5e inventory. If a helper already exists that the new test reimplements inline, flag it and name the helper. Duplicated blocks across new test methods that should be a single helper or a parameterized test are findings.
+- **Javadoc bloat:** flag multi-paragraph javadoc on `@Test` methods, javadoc that merely restates the test name, and stacked/duplicated javadoc ("javadoc piled on javadoc"). Test intent belongs in a precise test name plus, at most, a one-line comment.
+- **Residue and smells:** dead code, commented-out code, copy-paste leftovers (a `testFoo` that actually tests bar), `System.out.println` debugging, `@Ignore` without a referenced ticket, magic numbers >= 5 digits without `_` separators.
+- **Which standards apply:** zero-GC and `io.questdb.std`-over-`java.util` do NOT apply to test code — do not flag `java.util` collections or allocations in tests. Member ordering, `is/has` boolean naming, and SQL style DO apply.
+
+**Reviewer 13 — Regression-test efficacy verification:** For any PR that claims to fix a bug, verify the regression test would actually fail without the production change. Reason about reverting the production hunk and confirm the new or changed test's assertions would then fail. If the test still passes with the fix reverted, it is not a regression test — flag it. State, per test, which production line the test depends on and what its assertion would do if that line were reverted. Run only when the PR is a fix; skip for pure features or refactors.
+
 Combine all reviewer findings into a single deduplicated **draft** report. Do NOT present this draft to the user yet — it goes straight into verification.
 
 ## Step 3b: Verify every finding against source code
@@ -217,7 +244,9 @@ For each finding in the draft report:
    saving is negligible relative to the surrounding work. Exception: GC allocations on a hot path are always worth
    flagging, even a single one.
 9. **For cross-context findings (Reviewer 9)**: re-read the callsite in full, including its callers up two levels, and confirm the broken behavior is reachable from production code paths. Cross-context findings are high-value but also the easiest to overstate — verify carefully.
-10. **Classify each finding** as:
+10. **For test-efficacy findings (Reviewers 11, 13)**: re-read the cited assertion in full context and confirm it truly cannot fail — a "vacuous assertion" claim is a false positive if production code actually recomputes the asserted value. For "would pass without the fix" claims, trace what the assertion observes against the reverted production hunk before reporting.
+11. **For test-code-quality findings (Reviewer 12)**: confirm a flagged reflective access really has a non-reflective alternative (some QuestDB internals genuinely require reflection in tests) before reporting it. Confirm a "reinvented helper" finding by actually locating the helper with `rg` and checking its signature fits the test's need.
+12. **Classify each finding** as:
     - **CONFIRMED in-diff** — the bug is real and inside the diff
     - **CONFIRMED at out-of-diff callsite** — the bug is in an unchanged file because the changed symbol is used there in a way that's now broken (cite the file and the contract from 2.5c that was violated)
     - **FALSE POSITIVE** — the code is actually correct (explain why)
@@ -298,6 +327,14 @@ Review the diff for:
 - **Regression tests:** If this PR fixes a bug, is there a test that reproduces the original bug and would fail without the fix?
 - Use `read`/`bash` (rg/find) to find existing test files for the changed classes and verify they cover the new behavior.
 
+### Test code quality
+- **No vacuous assertions.** Every assertion must be able to fail. Flag `assertTrue(true)`, `assertFalse(false)`, `assertEquals(x, x)`, asserting a literal against the same literal, or a `@Test` body with no assertion and no `expected=`/`assertThrows`.
+- **Reflection is a last resort.** Flag `setAccessible(true)`, `getDeclaredField`/`getDeclaredMethod`, `Field.set`, `Class.forName` when a public API, existing helper, or constructor would reach the same state. Name the non-reflective path.
+- **Reuse before reinventing.** Search for existing helpers, base classes, and fixtures before accepting inline setup. Duplicated setup/assert blocks an existing helper or a parameterized test would cover are findings; name the helper.
+- **No javadoc bloat.** No multi-paragraph javadoc on `@Test` methods, no javadoc that restates the test name, no stacked/duplicated javadoc. Prefer a precise test name and at most a one-line comment.
+- **Test-appropriate standards.** zero-GC and `io.questdb.std`-over-`java.util` rules do NOT apply to tests — do not flag them there. Member ordering, `is/has` naming, and SQL style DO apply.
+- **No debugging residue.** No `System.out.println`, no commented-out code, no `@Ignore` without a referenced ticket.
+
 ### Unresolved TODOs and FIXMEs
 - Scan the diff for `TODO`, `FIXME`, `HACK`, `XXX`, and `WORKAROUND` comments. For each one found:
   - Is it a pre-existing comment that was just moved/reformatted, or newly introduced in this PR?