Feat/prompt eval by MaryChen68 · Pull Request #435 · AIToolsLab/writing-tools

MaryChen68 · 2026-05-29T19:17:54Z

Runs all prompt types against a set of test documents and prints the outputs, so you can eyeball whether the LLM is giving good results.
When LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY/LANGFUSE_BASE_URL are present in .env, every run is also logged to Langfuse so you can compare prompt versions in the UI.
Usage:
eyeball mode:
```
uv run python eval_prompts.py
```
compare two models side-by-side in terminal:

uv run python eval_prompts.py --compare gpt-4o gpt-4o-mini

using different LLM model

uv run python eval_prompts.py --model <model_name>

send results to Langfuse as an experiment (dataset is auto-created if missing)

uv run python eval_prompts.py --experiment <name of experiment>

```
Example:
```

uv run python eval_prompts.py --model gpt-5.4 --experiment gpt-5.4

…the LLM responds, and optionally saves the results to Langfuse for tracking.

MaryChen68 added 2 commits May 29, 2026 14:56

Add prompt evaluation script for comparing LLM output

5136cf5

Tests a bunch of writing prompts against sample documents to see how …

e98a306

…the LLM responds, and optionally saves the results to Langfuse for tracking.

MaryChen68 requested a review from kcarnold May 29, 2026 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/prompt eval#435

Feat/prompt eval#435
MaryChen68 wants to merge 2 commits into
mainfrom
feat/prompt-eval

MaryChen68 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaryChen68 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant