diff --git a/sites/docs/src/content/ai/evals.md b/sites/docs/src/content/ai/evals.md new file mode 100644 index 0000000000..35f66c6a19 --- /dev/null +++ b/sites/docs/src/content/ai/evals.md @@ -0,0 +1,30 @@ +--- +title: AI Evaluations +sidenav: ai +description: > + Learn about Dart and Flutter's evaluation frameworks for measuring AI tooling + reliability. +--- + +:::experimental +Evaluation tooling and benchmarks are experimental and likely to change. +::: + +To explore the evaluation strategy, +view the open-source dataset and scoring rubrics, +or get involved with community benchmark datasets, +visit the [Flutter Evals repository](https://github.com/flutter/evals). + +Evaluating the capabilities and reliability of AI agents requires testing +approaches that model actual developer tasks. +Because LLMs are non-deterministic, +standard unit testing is insufficient for verifying agentic behaviors like +codebase navigation, plan execution, and code synthesis. + +To build developer confidence in AI tooling, +Dart and Flutter use an evaluation system ("evals") +to test critical user journeys (CUJs). +Evals measure both deterministic code correctness +(compilation, lints, automated tests) and qualitative performance +(reasoning, safety, and conciseness) using automated model judges +and expert human grading. diff --git a/sites/docs/src/data/sidenav/ai.yml b/sites/docs/src/data/sidenav/ai.yml index d7643f3bc1..f077ff7bf8 100644 --- a/sites/docs/src/data/sidenav/ai.yml +++ b/sites/docs/src/data/sidenav/ai.yml @@ -19,6 +19,8 @@ permalink: /ai/gemini-cli-extension - title: Developer experience permalink: /ai/best-practices/developer-experience + - title: "AI Evaluations (experimental)" + permalink: /ai/evals - title: Build AI-powered apps expanded: true