chore(deps-dev): bump autoevals from 0.0.130 to 0.2.0 by dependabot[bot] · Pull Request #1621 · langfuse/langfuse-python

dependabot · 2026-04-10T14:49:22Z

Bumps autoevals from 0.0.130 to 0.2.0.

Release notes

autoevals Python v0.2.0

What's Changed

Update README.md by @davidatbraintrust in braintrustdata/autoevals#1

Setup: Include template *.yaml files in output by @wong-codaio in braintrustdata/autoevals#2

Updates to Levenshtein + methods by @wong-codaio in braintrustdata/autoevals#3

Make the package isomorphic by @abrenneke in braintrustdata/autoevals#5

Package uses cjs for commonjs, so .js needs to be module by @abrenneke in braintrustdata/autoevals#6

Add file extensions for true ESM by @abrenneke in braintrustdata/autoevals#7

Remove fancy regex from JS implementation by @ankrgyl in braintrustdata/autoevals#8

Update link to Autoevals docs by @dashk in braintrustdata/autoevals#13

Prettier: Fixes by @dashk in braintrustdata/autoevals#11

Github Action: Trigger pre-commit hooks by @dashk in braintrustdata/autoevals#12

Create JS workflow by @dashk in braintrustdata/autoevals#15

Create Python workflow by @dashk in braintrustdata/autoevals#14

Add numeric and json diff by @ankrgyl in braintrustdata/autoevals#16

Use function calling to parse responses by @ankrgyl in braintrustdata/autoevals#18

Add tracing support by @ankrgyl in braintrustdata/autoevals#17

Remove pydantic as a dependency by @ankrgyl in braintrustdata/autoevals#19

Enable azure openai engines by @ecatkins in braintrustdata/autoevals#20

Move score tracing into sdk framework. by @manugoyal in braintrustdata/autoevals#22

Update OpenAI version to v4 and add JS tracing by @ankrgyl in braintrustdata/autoevals#21

Disable threading checks and synchronize access with sync/async locks by @ankrgyl in braintrustdata/autoevals#23

Pin openai version to the one we use. by @manugoyal in braintrustdata/autoevals#24

Update pydoc-markdown requirement by @manugoyal in braintrustdata/autoevals#25

Improve mustache by @ankrgyl in braintrustdata/autoevals#26

Support multiple openai versions in python by @ankrgyl in braintrustdata/autoevals#27

Set the braintrust proxy as the api url to facilitate caching by @ankrgyl in braintrustdata/autoevals#33

Remove validity score by @manugoyal in braintrustdata/autoevals#34

Reduce dependencies on SDK span logic. by @manugoyal in braintrustdata/autoevals#35

Support python3.8 by @ankrgyl in braintrustdata/autoevals#36

Replace duplicated definitions with @braintrust/core. by @manugoyal in braintrustdata/autoevals#37

Update packaging script locations. by @manugoyal in braintrustdata/autoevals#38

Embedding distance by @ankrgyl in braintrustdata/autoevals#39

Rename EmbeddingDistance to EmbeddingSimilarity by @ankrgyl in braintrustdata/autoevals#41

Update bundling to resemble sdk and proxy by @ankrgyl in braintrustdata/autoevals#40

Add Levenshtein by @ankrgyl in braintrustdata/autoevals#42

Add a manifest of all autoevals in typescript and fix export error by @ankrgyl in braintrustdata/autoevals#43

Add turbo.json by @ankrgyl in braintrustdata/autoevals#45

Fix binding issue when tracing inside autoevals. by @manugoyal in braintrustdata/autoevals#46

Align with updates to JS tracing API. by @manugoyal in braintrustdata/autoevals#47

Updates to track the fact that Scores can be null by @ankrgyl in braintrustdata/autoevals#48

Support OPENAI_BASE_URL in autoevals by @ankrgyl in braintrustdata/autoevals#49

Fix tracing bug. by @manugoyal in braintrustdata/autoevals#50

Use openai wrapper by @ankrgyl in braintrustdata/autoevals#52

Add Sql scorer to python by @aphinx in braintrustdata/autoevals#53

Merge node-specific env initialization into env.ts. by @manugoyal in braintrustdata/autoevals#54

README fix + use enums to improve classification accuracy by @ankrgyl in braintrustdata/autoevals#55

Fall back to BRAINTRUST_API_KEY if OPENAI_API_KEY is not set. by @manugoyal in braintrustdata/autoevals#57

Don't swallow errors into scores. by @manugoyal in braintrustdata/autoevals#56

Bump core version by @ankrgyl in braintrustdata/autoevals#58

... (truncated)

Commits

a5854ee chore: Publish python via trusted publishing and unify release process (#183)
398ded6 Add pnpm enforcement and config (#182)
443f631 Update pnpm version and use frozen lockfile (#181)
110e252 chore: Publish JS package via gha trusted publishing (#180)
5b4b90c chore: Pin github actions to commit (#179)
c52da64 Bump to gpt5 models (#169)
71e61dd Filter system messages (#177)
0d428fb Trace injection in python to mirror the JS implementation (#175)
d99a37c Add models configuration object to init() (#164)
d78f4ab Fix MDX parsing by escaping curly braces in JSDoc comment (#174)
Additional commits viewable in compare view

Note
Automatic rebases have been disabled on this pull request as it has been open for over 30 days.

claude · 2026-04-10T15:01:34Z

    "langchain>=1,<2",
    "langgraph>=1,<2",
-    "autoevals>=0.0.130,<0.1",
+    "autoevals>=0.0.130,<0.3",


🟣 Pre-existing: create_evaluator_from_autoevals() in experiment.py:1046 passes evaluation.score directly to Evaluation(value=...) without a None guard; autoevals 0.2.0 formally declares Score.score: float | None = None (PR #48), making this path more likely to trigger. When score is None, it propagates silently through the unenforced type annotation, then is dropped from averages by the isinstance(evaluation.value, (int, float)) check at experiment.py:562-565, resulting in silent data loss.

Extended reasoning...

What the bug is and how it manifests

In langfuse/experiment.py:1046, create_evaluator_from_autoevals() wraps an autoevals evaluator and constructs a Langfuse Evaluation object. It does so with:

return Evaluation( name=evaluation.name, value=evaluation.score, # <-- no None check comment=..., metadata=..., )

In autoevals 0.2.0, the Score class declares score: float | None = None with the docstring: "If the score is None, the evaluation is considered to be skipped." (introduced in autoevals PR #48 — "Updates to track the fact that Scores can be null".) When an LLM-based scorer fails to parse a response or explicitly skips evaluation, it returns score=None.

The specific code path

autoevals_evaluator() returns a Score with .score = None.

Evaluation(value=None) is constructed — Python does not enforce type annotations at runtime, so this succeeds silently (see experiment.py:185: value: Union[int, float, str, bool] with no validation, just self.value = value at line 205).

The Evaluation object flows into ExperimentResult.format() at lines 562–565:
if evaluation.name == eval_name and isinstance(evaluation.value, (int, float)): scores.append(evaluation.value)
isinstance(None, (int, float)) is False, so the score is silently dropped from averages.

Additionally, if create_score(value=None) is called via _create_score_for_scope, ScoreBody (which uses CreateScoreValue = Union[float, str]) raises a Pydantic ValidationError — but this is caught and only logged in client.py's except block, further hiding the failure from the user.

Why existing code does not prevent it

Evaluation.__init__ has no runtime validation. The isinstance check in format() was designed to skip string/bool values, not to handle None — there is no warning or logging when a None score is silently excluded.

What the impact would be

Users employing LLM-based autoevals scorers (e.g., Factuality, ClosedQA, etc.) may experience silent omission of scores for items where the LLM evaluation call fails. Average scores reported in ExperimentResult will be computed over fewer items than expected, skewing results upward without any indication that some items were excluded.

How to fix it

Add a None guard in create_evaluator_from_autoevals():

if evaluation.score is None: return None # or raise, or return a special sentinel return Evaluation( name=evaluation.name, value=evaluation.score, ... )

Alternatively, log a warning and skip score creation explicitly so users are aware when evaluations are skipped.

Step-by-step proof

User calls create_evaluator_from_autoevals(Factuality()) to create a Langfuse evaluator.

During an experiment run, the OpenAI call inside Factuality.eval_async() fails or returns unparseable output.

autoevals 0.2.0 returns Score(name="Factuality", score=None, metadata=...) instead of raising.

langfuse_evaluator constructs Evaluation(name="Factuality", value=None) — no exception.

ExperimentResult.format() iterates evaluations, hits isinstance(None, (int, float)) == False, silently skips the item.

The printed average score for "Factuality" is computed over N-k items where k items silently failed, with no warning to the user.

Pre-existing status

The verifier refutation notes that the phrase "track the fact that Scores can be null" in PR #48 implies null scores may have been possible even in 0.0.130, and the langfuse wrapper was never updated to handle them. This is a valid point — the bug is pre-existing in the wrapper code. This PR does not modify experiment.py. However, autoevals 0.2.0 formally types and documents the null-score path, making it more likely to occur in practice, making this a reasonable time to address it.

Bumps [autoevals](https://github.com/braintrustdata/autoevals) from 0.0.130 to 0.2.0. - [Release notes](https://github.com/braintrustdata/autoevals/releases) - [Changelog](https://github.com/braintrustdata/autoevals/blob/main/CHANGELOG.md) - [Commits](braintrustdata/autoevals@py-0.0.130...py-0.2.0) --- updated-dependencies: - dependency-name: autoevals dependency-version: 0.2.0 dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

dependabot Bot added dependencies Pull requests that update a dependency file python:uv Pull requests that update python:uv code labels Apr 10, 2026

dependabot Bot mentioned this pull request Apr 10, 2026

chore(deps-dev): bump autoevals from 0.0.130 to 0.1.0 #1593

Closed

claude Bot reviewed Apr 10, 2026

View reviewed changes

dependabot Bot force-pushed the dependabot/uv/autoevals-0.2.0 branch 2 times, most recently from de9b89f to 07727c0 Compare April 28, 2026 05:59

dependabot Bot force-pushed the dependabot/uv/autoevals-0.2.0 branch from 07727c0 to 4433685 Compare May 6, 2026 05:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(deps-dev): bump autoevals from 0.0.130 to 0.2.0#1621

chore(deps-dev): bump autoevals from 0.0.130 to 0.2.0#1621
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/uv/autoevals-0.2.0

dependabot Bot commented on behalf of github Apr 10, 2026 •

edited

Loading

Uh oh!

claude Bot Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

dependabot Bot commented on behalf of github Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

autoevals Python v0.2.0

What's Changed

Uh oh!

claude Bot Apr 10, 2026

Choose a reason for hiding this comment

What the bug is and how it manifests

The specific code path

Why existing code does not prevent it

What the impact would be

How to fix it

Step-by-step proof

Pre-existing status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

dependabot Bot commented on behalf of github Apr 10, 2026 •

edited

Loading