Skip to content

feat(bench): LLM benchmarks, regression baselines, and bundle size tracking#8

Open
SimplyLiz wants to merge 13 commits intodevelopfrom
feature/llm-benchmark-integration
Open

feat(bench): LLM benchmarks, regression baselines, and bundle size tracking#8
SimplyLiz wants to merge 13 commits intodevelopfrom
feature/llm-benchmark-integration

Conversation

@SimplyLiz
Copy link
Owner

Summary

  • LLM benchmark integration — opt-in --llm flag runs summarization benchmarks against OpenAI, Anthropic, and Ollama with auto-detection, persists results as non-deterministic reference data
  • Regression baselines--save / --check / --tolerance flags with versioned JSON baselines, CI runs bench:check on every push/PR
  • Rich benchmark docs — auto-generated docs/benchmark-results.md with Mermaid charts, ASCII comparison bars, badges, provider summary tables, and collapsible detail sections
  • Bundle size tracking — measures each dist/*.js file raw + gzip, stores in baseline, fails CI on regression
  • Compression refactor — unified sync/async paths via generator pattern, reducing duplication in src/compress.ts

What's included (13 commits)

  • LLM benchmark scaffolding with provider auto-detection and .env loading
  • --save, --check, --tolerance baseline management
  • Benchmark handbook split (docs/benchmarks.md + docs/benchmark-results.md)
  • Badges, pie chart, Mermaid bar charts, ASCII horizontal bars for LLM vs deterministic
  • Provider summary table with fuzzy dedup delta
  • Bundle size per-file tracking with gzip badge
  • defaultTokenCounter rationale docs
  • Sync/async unification in compress pipeline

Test plan

  • npm run build compiles
  • npm test — 333 tests pass
  • npm run lint — clean
  • npm run bench:save — saves baseline with all metrics including bundle size
  • npm run bench -- --check — passes against saved baseline
  • docs/benchmark-results.md — Bundle Size section, gzip badge, all tables render
  • bench/baselines/current.jsonbundleSize key present

- Add inline .env parser in bench/run.ts (no dependency, won't override existing vars)
- Probe localhost:11434/api/tags to auto-detect Ollama without env vars
- Add LLM result types and save/load in bench/baseline.ts
- Auto-save LLM results to bench/baselines/llm/<provider>-<model>.json
- Extend doc generator with LLM comparison tables when result files exist
- Add .env.example template with commented-out provider keys
- Update skip message to mention Ollama auto-detection
… metrics

LLM benchmarks previously ran automatically when API keys were
detected, silently burning money on every `npm run bench`. Now
requires explicit `--llm` flag (`npm run bench:llm`).

Additions:
- Technical explanation scenario (pure prose, no code fences)
- vsDet expansion metric (LLM ratio / deterministic ratio)
- Token budget + LLM section (deterministic vs llm-escalate)
- bench:llm npm script

Fixes:
- .env parser: strip quotes, handle `export` prefix
- loadAllLlmResults: try/catch per file for malformed JSON
- Ollama: verify model availability via /api/tags response
- Anthropic: guard against empty content array
- LLM benchmark loop: per-scenario try/catch
- Doc generation: scenario count 7→8, add Technical explanation
…ture

- --save: writes current.json + history/v{version}.json, regenerates docs
- --check: compares against current.json, exits non-zero on regression
- --tolerance N: allows N% deviation (0% default, deterministic)
- Baselines reorganized: current.json at root, history/ for versioned
  snapshots, llm/ for non-deterministic reference data
- bench:llm added to package.json for explicit LLM benchmark runs
- Doc generation references correct baseline paths
Split docs/benchmarks.md into two files:
- docs/benchmarks.md: hand-written handbook (how to run, scenarios,
  interpreting results, regression testing)
- docs/benchmark-results.md: auto-generated by bench:save with Mermaid
  xychart-beta charts, summary table, and polished data presentation

Rewrite generateBenchmarkDocs() with compression ratio chart, dedup
impact chart, LLM comparison chart, key findings callout, and
conditional sections for LLM data and version history.
…pie chart

Add shields.io badges, unicode progress bars, reduction % and message
count columns to the compression table, a Mermaid pie chart for message
outcomes, and collapsible details sections for LLM provider tables.
Drop progress bar column from compression table — unicode blocks render
with variable width in GitHub's proportional-font tables. Switch LLM
comparison chart from double bar (stacked) to bar+line so both series
are visible side by side.
Interleave "Scenario (Det)" and "Scenario (LLM)" labels on the x-axis
so each scenario gets two side-by-side bars in a single series, avoiding
Mermaid's stacked-bar behavior.
Mermaid xychart can't do grouped bars — stacks or overlaps labels.
Replace with a clean comparison table showing Det vs Best LLM ratio,
delta percentage, and winner per scenario.
…arison

Render comparison as paired horizontal bars inside a fenced code block
(monospace), replacing the broken Mermaid chart. Each scenario shows
Det and LLM bars side by side with ratios and a star for LLM wins.
compressSync and compressAsync were identical (~180 lines each) except
for 2 summarize call sites. Replace both with a single compressGen
generator that yields summarize requests, driven by thin sync/async
runners. Removes 149 lines of duplication, no public API changes.
…CII charts

- Cross-provider summary table with avg ratio, vsDet, budget fits, time
- Fuzzy dedup table gains "vs Base" column highlighting improvements
- ASCII comparison charts now render for all providers, not just best
Measure each dist/*.js file and total after tsc build. Adds
BundleSizeResult type, comparison loop for --check regression
detection, doc section with table, and gzip badge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant