chore(deps-dev): bump autoevals from 0.0.130 to 0.2.0#1621
chore(deps-dev): bump autoevals from 0.0.130 to 0.2.0#1621dependabot[bot] wants to merge 1 commit intomainfrom
Conversation
| "langchain>=1,<2", | ||
| "langgraph>=1,<2", | ||
| "autoevals>=0.0.130,<0.1", | ||
| "autoevals>=0.0.130,<0.3", |
There was a problem hiding this comment.
🟣 Pre-existing: create_evaluator_from_autoevals() in experiment.py:1046 passes evaluation.score directly to Evaluation(value=...) without a None guard; autoevals 0.2.0 formally declares Score.score: float | None = None (PR #48), making this path more likely to trigger. When score is None, it propagates silently through the unenforced type annotation, then is dropped from averages by the isinstance(evaluation.value, (int, float)) check at experiment.py:562-565, resulting in silent data loss.
Extended reasoning...
What the bug is and how it manifests
In langfuse/experiment.py:1046, create_evaluator_from_autoevals() wraps an autoevals evaluator and constructs a Langfuse Evaluation object. It does so with:
return Evaluation(
name=evaluation.name,
value=evaluation.score, # <-- no None check
comment=...,
metadata=...,
)In autoevals 0.2.0, the Score class declares score: float | None = None with the docstring: "If the score is None, the evaluation is considered to be skipped." (introduced in autoevals PR #48 — "Updates to track the fact that Scores can be null".) When an LLM-based scorer fails to parse a response or explicitly skips evaluation, it returns score=None.
The specific code path
autoevals_evaluator()returns aScorewith.score = None.Evaluation(value=None)is constructed — Python does not enforce type annotations at runtime, so this succeeds silently (seeexperiment.py:185:value: Union[int, float, str, bool]with no validation, justself.value = valueat line 205).- The
Evaluationobject flows intoExperimentResult.format()at lines 562–565:if evaluation.name == eval_name and isinstance(evaluation.value, (int, float)): scores.append(evaluation.value)
isinstance(None, (int, float))isFalse, so the score is silently dropped from averages. - Additionally, if
create_score(value=None)is called via_create_score_for_scope,ScoreBody(which usesCreateScoreValue = Union[float, str]) raises a PydanticValidationError— but this is caught and only logged inclient.py'sexceptblock, further hiding the failure from the user.
Why existing code does not prevent it
Evaluation.__init__ has no runtime validation. The isinstance check in format() was designed to skip string/bool values, not to handle None — there is no warning or logging when a None score is silently excluded.
What the impact would be
Users employing LLM-based autoevals scorers (e.g., Factuality, ClosedQA, etc.) may experience silent omission of scores for items where the LLM evaluation call fails. Average scores reported in ExperimentResult will be computed over fewer items than expected, skewing results upward without any indication that some items were excluded.
How to fix it
Add a None guard in create_evaluator_from_autoevals():
if evaluation.score is None:
return None # or raise, or return a special sentinel
return Evaluation(
name=evaluation.name,
value=evaluation.score,
...
)Alternatively, log a warning and skip score creation explicitly so users are aware when evaluations are skipped.
Step-by-step proof
- User calls
create_evaluator_from_autoevals(Factuality())to create a Langfuse evaluator. - During an experiment run, the OpenAI call inside
Factuality.eval_async()fails or returns unparseable output. - autoevals 0.2.0 returns
Score(name="Factuality", score=None, metadata=...)instead of raising. langfuse_evaluatorconstructsEvaluation(name="Factuality", value=None)— no exception.ExperimentResult.format()iterates evaluations, hitsisinstance(None, (int, float)) == False, silently skips the item.- The printed average score for "Factuality" is computed over N-k items where k items silently failed, with no warning to the user.
Pre-existing status
The verifier refutation notes that the phrase "track the fact that Scores can be null" in PR #48 implies null scores may have been possible even in 0.0.130, and the langfuse wrapper was never updated to handle them. This is a valid point — the bug is pre-existing in the wrapper code. This PR does not modify experiment.py. However, autoevals 0.2.0 formally types and documents the null-score path, making it more likely to occur in practice, making this a reasonable time to address it.
eeaf18d to
de9b89f
Compare
Bumps [autoevals](https://github.com/braintrustdata/autoevals) from 0.0.130 to 0.2.0. - [Release notes](https://github.com/braintrustdata/autoevals/releases) - [Changelog](https://github.com/braintrustdata/autoevals/blob/main/CHANGELOG.md) - [Commits](braintrustdata/autoevals@py-0.0.130...py-0.2.0) --- updated-dependencies: - dependency-name: autoevals dependency-version: 0.2.0 dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
de9b89f to
07727c0
Compare
Bumps autoevals from 0.0.130 to 0.2.0.
Release notes
Sourced from autoevals's releases.
... (truncated)
Commits
a5854eechore: Publish python via trusted publishing and unify release process (#183)398ded6Add pnpm enforcement and config (#182)443f631Update pnpm version and use frozen lockfile (#181)110e252chore: Publish JS package via gha trusted publishing (#180)5b4b90cchore: Pin github actions to commit (#179)c52da64Bump to gpt5 models (#169)71e61ddFilter system messages (#177)0d428fbTrace injection in python to mirror the JS implementation (#175)d99a37cAdd models configuration object to init() (#164)d78f4abFix MDX parsing by escaping curly braces in JSDoc comment (#174)