chore(deps-dev): bump autoevals from 0.0.130 to 0.2.0#1621
chore(deps-dev): bump autoevals from 0.0.130 to 0.2.0#1621dependabot[bot] wants to merge 1 commit intomainfrom
Conversation
Bumps [autoevals](https://github.com/braintrustdata/autoevals) from 0.0.130 to 0.2.0. - [Release notes](https://github.com/braintrustdata/autoevals/releases) - [Changelog](https://github.com/braintrustdata/autoevals/blob/main/CHANGELOG.md) - [Commits](braintrustdata/autoevals@py-0.0.130...py-0.2.0) --- updated-dependencies: - dependency-name: autoevals dependency-version: 0.2.0 dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
| "langchain>=1,<2", | ||
| "langgraph>=1,<2", | ||
| "autoevals>=0.0.130,<0.1", | ||
| "autoevals>=0.0.130,<0.3", |
There was a problem hiding this comment.
🟣 Pre-existing: create_evaluator_from_autoevals() in experiment.py:1046 passes evaluation.score directly to Evaluation(value=...) without a None guard; autoevals 0.2.0 formally declares Score.score: float | None = None (PR #48), making this path more likely to trigger. When score is None, it propagates silently through the unenforced type annotation, then is dropped from averages by the isinstance(evaluation.value, (int, float)) check at experiment.py:562-565, resulting in silent data loss.
Extended reasoning...
What the bug is and how it manifests
In langfuse/experiment.py:1046, create_evaluator_from_autoevals() wraps an autoevals evaluator and constructs a Langfuse Evaluation object. It does so with:
return Evaluation(
name=evaluation.name,
value=evaluation.score, # <-- no None check
comment=...,
metadata=...,
)In autoevals 0.2.0, the Score class declares score: float | None = None with the docstring: "If the score is None, the evaluation is considered to be skipped." (introduced in autoevals PR #48 — "Updates to track the fact that Scores can be null".) When an LLM-based scorer fails to parse a response or explicitly skips evaluation, it returns score=None.
The specific code path
autoevals_evaluator()returns aScorewith.score = None.Evaluation(value=None)is constructed — Python does not enforce type annotations at runtime, so this succeeds silently (seeexperiment.py:185:value: Union[int, float, str, bool]with no validation, justself.value = valueat line 205).- The
Evaluationobject flows intoExperimentResult.format()at lines 562–565:if evaluation.name == eval_name and isinstance(evaluation.value, (int, float)): scores.append(evaluation.value)
isinstance(None, (int, float))isFalse, so the score is silently dropped from averages. - Additionally, if
create_score(value=None)is called via_create_score_for_scope,ScoreBody(which usesCreateScoreValue = Union[float, str]) raises a PydanticValidationError— but this is caught and only logged inclient.py'sexceptblock, further hiding the failure from the user.
Why existing code does not prevent it
Evaluation.__init__ has no runtime validation. The isinstance check in format() was designed to skip string/bool values, not to handle None — there is no warning or logging when a None score is silently excluded.
What the impact would be
Users employing LLM-based autoevals scorers (e.g., Factuality, ClosedQA, etc.) may experience silent omission of scores for items where the LLM evaluation call fails. Average scores reported in ExperimentResult will be computed over fewer items than expected, skewing results upward without any indication that some items were excluded.
How to fix it
Add a None guard in create_evaluator_from_autoevals():
if evaluation.score is None:
return None # or raise, or return a special sentinel
return Evaluation(
name=evaluation.name,
value=evaluation.score,
...
)Alternatively, log a warning and skip score creation explicitly so users are aware when evaluations are skipped.
Step-by-step proof
- User calls
create_evaluator_from_autoevals(Factuality())to create a Langfuse evaluator. - During an experiment run, the OpenAI call inside
Factuality.eval_async()fails or returns unparseable output. - autoevals 0.2.0 returns
Score(name="Factuality", score=None, metadata=...)instead of raising. langfuse_evaluatorconstructsEvaluation(name="Factuality", value=None)— no exception.ExperimentResult.format()iterates evaluations, hitsisinstance(None, (int, float)) == False, silently skips the item.- The printed average score for "Factuality" is computed over N-k items where k items silently failed, with no warning to the user.
Pre-existing status
The verifier refutation notes that the phrase "track the fact that Scores can be null" in PR #48 implies null scores may have been possible even in 0.0.130, and the langfuse wrapper was never updated to handle them. This is a valid point — the bug is pre-existing in the wrapper code. This PR does not modify experiment.py. However, autoevals 0.2.0 formally types and documents the null-score path, making it more likely to occur in practice, making this a reasonable time to address it.
Bumps autoevals from 0.0.130 to 0.2.0.
Release notes
Sourced from autoevals's releases.
... (truncated)
Commits
a5854eechore: Publish python via trusted publishing and unify release process (#183)398ded6Add pnpm enforcement and config (#182)443f631Update pnpm version and use frozen lockfile (#181)110e252chore: Publish JS package via gha trusted publishing (#180)5b4b90cchore: Pin github actions to commit (#179)c52da64Bump to gpt5 models (#169)71e61ddFilter system messages (#177)0d428fbTrace injection in python to mirror the JS implementation (#175)d99a37cAdd models configuration object to init() (#164)d78f4abFix MDX parsing by escaping curly braces in JSDoc comment (#174)You can trigger a rebase of this PR by commenting
@dependabot rebase.Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
@dependabot rebasewill rebase this PR@dependabot recreatewill recreate this PR, overwriting any edits that have been made to it@dependabot show <dependency name> ignore conditionswill show all of the ignore conditions of the specified dependency@dependabot ignore this major versionwill close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this minor versionwill close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this dependencywill close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)Disclaimer: Experimental PR review
Greptile Summary
This PR bumps the dev-only
autoevalsdependency from0.0.130to0.2.0viauv.lock. The version is already within the>=0.0.130,<0.3range declared inpyproject.toml, so no manifest change is needed. Sinceautoevalsis a[dependency-groups] devdependency, production builds are unaffected.Confidence Score: 5/5
Safe to merge — dev-only dependency bump with no production impact.
autoevals is a dev-only dependency; production builds and the published package are unaffected. The version 0.2.0 is within the already-declared constraint. The only nuance is that null scores (newly possible in 0.2.0) are not guarded in
create_evaluator_from_autoevals, but that is a pre-existing style gap and not introduced by this PR.No files require special attention.
Important Files Changed
>=0.0.130,<0.3already accommodates 0.2.0.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[autoevals 0.2.0\ndev dependency] -->|create_evaluator_from_autoevals| B[langfuse/experiment.py] B --> C[autoevals_evaluator called\nwith input/output/expected] C --> D{evaluation.score} D -->|numeric / string / bool| E[Evaluation\nname=evaluation.name\nvalue=evaluation.score] D -->|None\nnew in 0.2.0| F[Evaluation value=None\ntype mismatch — not enforced at runtime] E --> G[Returned to caller] F --> GReviews (1): Last reviewed commit: "chore(deps-dev): bump autoevals from 0.0..." | Re-trigger Greptile