feat(bench): LLM benchmarks, regression baselines, and bundle size tracking by SimplyLiz · Pull Request #8 · SimplyLiz/ContextCompressionEngine

SimplyLiz · 2026-03-01T12:24:22Z

Summary

LLM benchmark integration — opt-in --llm flag runs summarization benchmarks against OpenAI, Anthropic, and Ollama with auto-detection, persists results as non-deterministic reference data
Regression baselines — --save / --check / --tolerance flags with versioned JSON baselines, CI runs bench:check on every push/PR
Rich benchmark docs — auto-generated docs/benchmark-results.md with Mermaid charts, ASCII comparison bars, badges, provider summary tables, and collapsible detail sections
Bundle size tracking — measures each dist/*.js file raw + gzip, stores in baseline, fails CI on regression
Compression refactor — unified sync/async paths via generator pattern, reducing duplication in src/compress.ts

What's included (13 commits)

LLM benchmark scaffolding with provider auto-detection and .env loading
--save, --check, --tolerance baseline management
Benchmark handbook split (docs/benchmarks.md + docs/benchmark-results.md)
Badges, pie chart, Mermaid bar charts, ASCII horizontal bars for LLM vs deterministic
Provider summary table with fuzzy dedup delta
Bundle size per-file tracking with gzip badge
defaultTokenCounter rationale docs
Sync/async unification in compress pipeline

Test plan

npm run build compiles
npm test — 333 tests pass
npm run lint — clean
npm run bench:save — saves baseline with all metrics including bundle size
npm run bench -- --check — passes against saved baseline
docs/benchmark-results.md — Bundle Size section, gzip badge, all tables render
bench/baselines/current.json — bundleSize key present

- Add inline .env parser in bench/run.ts (no dependency, won't override existing vars) - Probe localhost:11434/api/tags to auto-detect Ollama without env vars - Add LLM result types and save/load in bench/baseline.ts - Auto-save LLM results to bench/baselines/llm/<provider>-<model>.json - Extend doc generator with LLM comparison tables when result files exist - Add .env.example template with commented-out provider keys - Update skip message to mention Ollama auto-detection

… metrics LLM benchmarks previously ran automatically when API keys were detected, silently burning money on every `npm run bench`. Now requires explicit `--llm` flag (`npm run bench:llm`). Additions: - Technical explanation scenario (pure prose, no code fences) - vsDet expansion metric (LLM ratio / deterministic ratio) - Token budget + LLM section (deterministic vs llm-escalate) - bench:llm npm script Fixes: - .env parser: strip quotes, handle `export` prefix - loadAllLlmResults: try/catch per file for malformed JSON - Ollama: verify model availability via /api/tags response - Anthropic: guard against empty content array - LLM benchmark loop: per-scenario try/catch - Doc generation: scenario count 7→8, add Technical explanation

…ture - --save: writes current.json + history/v{version}.json, regenerates docs - --check: compares against current.json, exits non-zero on regression - --tolerance N: allows N% deviation (0% default, deterministic) - Baselines reorganized: current.json at root, history/ for versioned snapshots, llm/ for non-deterministic reference data - bench:llm added to package.json for explicit LLM benchmark runs - Doc generation references correct baseline paths

Split docs/benchmarks.md into two files: - docs/benchmarks.md: hand-written handbook (how to run, scenarios, interpreting results, regression testing) - docs/benchmark-results.md: auto-generated by bench:save with Mermaid xychart-beta charts, summary table, and polished data presentation Rewrite generateBenchmarkDocs() with compression ratio chart, dedup impact chart, LLM comparison chart, key findings callout, and conditional sections for LLM data and version history.

…pie chart Add shields.io badges, unicode progress bars, reduction % and message count columns to the compression table, a Mermaid pie chart for message outcomes, and collapsible details sections for LLM provider tables.

Drop progress bar column from compression table — unicode blocks render with variable width in GitHub's proportional-font tables. Switch LLM comparison chart from double bar (stacked) to bar+line so both series are visible side by side.

Interleave "Scenario (Det)" and "Scenario (LLM)" labels on the x-axis so each scenario gets two side-by-side bars in a single series, avoiding Mermaid's stacked-bar behavior.

Mermaid xychart can't do grouped bars — stacks or overlaps labels. Replace with a clean comparison table showing Det vs Best LLM ratio, delta percentage, and winner per scenario.

…arison Render comparison as paired horizontal bars inside a fenced code block (monospace), replacing the broken Mermaid chart. Each scenario shows Det and LLM bars side by side with ratios and a star for LLM wins.

compressSync and compressAsync were identical (~180 lines each) except for 2 summarize call sites. Replace both with a single compressGen generator that yields summarize requests, driven by thin sync/async runners. Removes 149 lines of duplication, no public API changes.

…CII charts - Cross-provider summary table with avg ratio, vsDet, budget fits, time - Fuzzy dedup table gains "vs Base" column highlighting improvements - ASCII comparison charts now render for all providers, not just best

Measure each dist/*.js file and total after tsc build. Adds BundleSizeResult type, comparison loop for --check regression detection, doc section with table, and gzip badge.

SimplyLiz added 13 commits February 25, 2026 13:00

fix(bench): use paired bars for LLM comparison chart

4b92c41

Interleave "Scenario (Det)" and "Scenario (LLM)" labels on the x-axis so each scenario gets two side-by-side bars in a single series, avoiding Mermaid's stacked-bar behavior.

fix(bench): replace broken LLM comparison chart with summary table

67b6ef8

Mermaid xychart can't do grouped bars — stacks or overlaps labels. Replace with a clean comparison table showing Det vs Best LLM ratio, delta percentage, and winner per scenario.

docs: clarify defaultTokenCounter rationale across docs and source

1670993

feat(bench): track bundle size per-file with gzip in benchmark suite

6695597

Measure each dist/*.js file and total after tsc build. Adds BundleSizeResult type, comparison loop for --check regression detection, doc section with table, and gzip badge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): LLM benchmarks, regression baselines, and bundle size tracking#8

feat(bench): LLM benchmarks, regression baselines, and bundle size tracking#8
SimplyLiz wants to merge 13 commits intodevelopfrom
feature/llm-benchmark-integration

SimplyLiz commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SimplyLiz commented Mar 1, 2026

Summary

What's included (13 commits)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant