Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions apps/web/src/content/docs/docs/evaluation/sdk.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -356,6 +356,32 @@ const { results, summary } = await evaluate({
console.log(`${summary.passed}/${summary.total} passed`);
```

A strict OR is easy with `assert` inline handlers:

```typescript
import { evaluate } from '@agentv/sdk';

const { summary } = await evaluate({
tests: [
{
id: 'capital',
input: 'What is the capital of France?',
expectedOutput: 'Paris',
assert: [
({ output }) => ({
name: 'capital-or-phrase',
score: ((output ?? '').includes('Paris') || /capital of france/i.test(output ?? '')) ? 1 : 0,
}),
],
},
],
task: async (input) => `Agent: ${input}`,
threshold: 0.8,
});

console.log(`${summary.passed}/${summary.total} passed`);
```

Auto-discovers the `default` target from `.agentv/targets.yaml` and `.env` credentials.

### File-Based via `specFile`
Expand Down
81 changes: 80 additions & 1 deletion apps/web/src/content/docs/docs/graders/composite.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ assertions:
Each sub-grader runs independently, then the aggregator combines their results.
Use `assertions` for composite members. `graders` is still accepted for backward compatibility.

If you only need weighted-average aggregation, a plain test-level `assertions` list already computes a weighted mean across graders. Use `composite` when you need a custom aggregation strategy (`threshold`, `code_grader`, `llm_grader`) or nested grader groups.
If you only need weighted-average aggregation, a plain test-level `assertions` list already computes a weighted mean across graders. Use `composite` when you need a custom aggregation strategy (`threshold`, `code-grader`, `llm-grader`) or nested grader groups.

## Aggregator Types

Expand All @@ -57,6 +57,85 @@ The score is calculated as:
final_score = sum(score_i * weight_i) / sum(weight_i)
```

## Composition Patterns

### AND Logic

Use a `threshold` aggregator with `1.0` so all child graders must pass:

```yaml
assertions:
- name: all_must_pass
type: composite
aggregator:
type: threshold
threshold: 1.0
assertions:
- name: mentions-capital
type: contains
value: capital
- name: mentions-paris
type: contains
value: Paris
```

### OR Logic (Approximate)

`weighted_average` can work for “any should pass” when your child scores are binary (`0`/`1`):

```yaml
assertions:
- name: any_match
type: composite
aggregator:
type: weighted_average
assertions:
- type: contains
value: Paris
- type: icontains
value: "the capital of france is paris"
```

Because this is an average, the final score is the fraction of passing children (`1/2` here when one assertion passes). If you want `pass` on any single hit with binary children, set the parent test threshold to `1 / N` (for two children, `0.5`), or use a custom aggregator below.

### OR Logic (Strict)

For a strict OR, add a custom code-grader aggregator and return `1.0` when any child score passes:

```yaml
assertions:
- name: strict_or
type: composite
aggregator:
type: code-grader
path: ./scripts/or-aggregator.js
assertions:
- name: mentions-paris
type: contains
value: Paris
- name: mentions-capital
type: contains
value: capital
```

```javascript
// examples/features/composite/scripts/or-aggregator.js
const fs = require('node:fs');

const payload = JSON.parse(fs.readFileSync(0, 'utf8'));
const results = Object.values(payload.results);
const anyPassed = results.some((r) => (r.verdict ?? 'fail') === 'pass');

console.log(
JSON.stringify({
score: anyPassed ? 1 : 0,
verdict: anyPassed ? 'pass' : 'fail',
assertions: [{ text: `Any-or gate: ${anyPassed ? 'passed' : 'failed'}`, passed: anyPassed }],
}),
);
```
```

### Code Grader Aggregator

Run a custom command to decide the final score based on all grader results:
Expand Down
3 changes: 2 additions & 1 deletion examples/features/composite/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Demonstrates composite grader patterns for combining multiple evaluation criteri

- Combining multiple graders in a single test case
- Weighted scoring across graders
- AND/OR logic patterns
- AND/OR logic patterns (documented in the docs page)
- Hierarchical evaluation strategies

## Running
Expand All @@ -19,3 +19,4 @@ bun agentv eval examples/features/composite/evals/dataset.eval.yaml
## Key Files

- `evals/dataset.eval.yaml` - Test cases with composite grader patterns
- `apps/web/src/content/docs/docs/graders/composite.mdx` - Detailed AND/OR and strict-OR composition guidance
Loading