feat(bench): LLM benchmarks, regression baselines, and bundle size tracking#8
Open
feat(bench): LLM benchmarks, regression baselines, and bundle size tracking#8
Conversation
- Add inline .env parser in bench/run.ts (no dependency, won't override existing vars) - Probe localhost:11434/api/tags to auto-detect Ollama without env vars - Add LLM result types and save/load in bench/baseline.ts - Auto-save LLM results to bench/baselines/llm/<provider>-<model>.json - Extend doc generator with LLM comparison tables when result files exist - Add .env.example template with commented-out provider keys - Update skip message to mention Ollama auto-detection
… metrics LLM benchmarks previously ran automatically when API keys were detected, silently burning money on every `npm run bench`. Now requires explicit `--llm` flag (`npm run bench:llm`). Additions: - Technical explanation scenario (pure prose, no code fences) - vsDet expansion metric (LLM ratio / deterministic ratio) - Token budget + LLM section (deterministic vs llm-escalate) - bench:llm npm script Fixes: - .env parser: strip quotes, handle `export` prefix - loadAllLlmResults: try/catch per file for malformed JSON - Ollama: verify model availability via /api/tags response - Anthropic: guard against empty content array - LLM benchmark loop: per-scenario try/catch - Doc generation: scenario count 7→8, add Technical explanation
…ture
- --save: writes current.json + history/v{version}.json, regenerates docs
- --check: compares against current.json, exits non-zero on regression
- --tolerance N: allows N% deviation (0% default, deterministic)
- Baselines reorganized: current.json at root, history/ for versioned
snapshots, llm/ for non-deterministic reference data
- bench:llm added to package.json for explicit LLM benchmark runs
- Doc generation references correct baseline paths
Split docs/benchmarks.md into two files: - docs/benchmarks.md: hand-written handbook (how to run, scenarios, interpreting results, regression testing) - docs/benchmark-results.md: auto-generated by bench:save with Mermaid xychart-beta charts, summary table, and polished data presentation Rewrite generateBenchmarkDocs() with compression ratio chart, dedup impact chart, LLM comparison chart, key findings callout, and conditional sections for LLM data and version history.
…pie chart Add shields.io badges, unicode progress bars, reduction % and message count columns to the compression table, a Mermaid pie chart for message outcomes, and collapsible details sections for LLM provider tables.
Drop progress bar column from compression table — unicode blocks render with variable width in GitHub's proportional-font tables. Switch LLM comparison chart from double bar (stacked) to bar+line so both series are visible side by side.
Interleave "Scenario (Det)" and "Scenario (LLM)" labels on the x-axis so each scenario gets two side-by-side bars in a single series, avoiding Mermaid's stacked-bar behavior.
Mermaid xychart can't do grouped bars — stacks or overlaps labels. Replace with a clean comparison table showing Det vs Best LLM ratio, delta percentage, and winner per scenario.
…arison Render comparison as paired horizontal bars inside a fenced code block (monospace), replacing the broken Mermaid chart. Each scenario shows Det and LLM bars side by side with ratios and a star for LLM wins.
compressSync and compressAsync were identical (~180 lines each) except for 2 summarize call sites. Replace both with a single compressGen generator that yields summarize requests, driven by thin sync/async runners. Removes 149 lines of duplication, no public API changes.
…CII charts - Cross-provider summary table with avg ratio, vsDet, budget fits, time - Fuzzy dedup table gains "vs Base" column highlighting improvements - ASCII comparison charts now render for all providers, not just best
Measure each dist/*.js file and total after tsc build. Adds BundleSizeResult type, comparison loop for --check regression detection, doc section with table, and gzip badge.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--llmflag runs summarization benchmarks against OpenAI, Anthropic, and Ollama with auto-detection, persists results as non-deterministic reference data--save/--check/--toleranceflags with versioned JSON baselines, CI runsbench:checkon every push/PRdocs/benchmark-results.mdwith Mermaid charts, ASCII comparison bars, badges, provider summary tables, and collapsible detail sectionsdist/*.jsfile raw + gzip, stores in baseline, fails CI on regressionsrc/compress.tsWhat's included (13 commits)
.envloading--save,--check,--tolerancebaseline managementdocs/benchmarks.md+docs/benchmark-results.md)defaultTokenCounterrationale docsTest plan
npm run buildcompilesnpm test— 333 tests passnpm run lint— cleannpm run bench:save— saves baseline with all metrics including bundle sizenpm run bench -- --check— passes against saved baselinedocs/benchmark-results.md— Bundle Size section, gzip badge, all tables renderbench/baselines/current.json—bundleSizekey present