Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions config/_default/menus/main.en.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5277,6 +5277,11 @@ menu:
parent: llm_obs_external_evaluations
identifier: llm_obs_deepeval_evaluations
weight: 40301
- name: Pydantic Evaluations
url: llm_observability/evaluations/pydantic_evaluations
parent: llm_obs_external_evaluations
identifier: llm_obs_pydantic_evaluations
weight: 40302
- name: Annotation Queues
url: llm_observability/evaluations/annotation_queues
parent: llm_obs_evaluations
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
---
title: DeepEval Evaluations
description: Use DeepEval evaluations with LLM Observability Experiments.
further_reading:
- link: "/llm_observability/evaluations/external_evaluations"
tag: "Documentation"
text: "Submit Evaluations"
---

## Overview
Expand Down
136 changes: 136 additions & 0 deletions content/en/llm_observability/evaluations/pydantic_evaluations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
---
title: Pydantic Evaluations
description: Use Pydantic evaluations with LLM Observability Experiments.
further_reading:
- link: "/llm_observability/evaluations/external_evaluations"
tag: "Documentation"
text: "Submit Evaluations"
---

## Overview

Pydantic is an open source framework that provides ready-to-use evaluations and allows for customizable LLM evaluations. For more information, see [Pydantic's documentation][3].

You can use LLM Observability to run Pydantic evaluations and scalar Pydantic report evaluations in [Experiments][1]. Pydantic evaluation results appear as evaluator results tied to each instance in an [LLM Observability dataset][5]. Pydantic report evaluations appear as a scalar result tied to an LLM Observability dataset.

## Setup

1. Set up an [LLM Observability Experiment][2] and an [LLM Observability Dataset][4].
2. Provide a Pydantic evaluator to the `evaluators` parameter in an LLMObs `Experiment` as demonstrated in the following code sample. (Optional) Provide a Pydantic report evaluator to the `summary_evaluators` parameter in an LLMObs `Experiment`. **Note**: Only Pydantic report evaluators that return a `ScalarResult` are supported.
a. For a working example, see [Datadog's Pydantic demo in GitHub][6].

```python

from pydantic_evals.evaluators import (
EqualsExpected,
EvaluationReason,
Evaluator,
EvaluatorContext,
EvaluatorOutput,
ReportEvaluator,
ReportEvaluatorContext,
)
from pydantic_evals.reporting.analyses import ScalarResult

from ddtrace.llmobs import LLMObs


LLMObs.enable(
api_key="<YOUR_API_KEY>", # defaults to DD_API_KEY environment variable
app_key="<YOUR_APP_KEY>", # defaults to DD_APP_KEY environment variable
site="datadoghq.com", # defaults to DD_SITE environment variable
project_name="<YOUR_PROJECT>" # defaults to DD_LLMOBS_PROJECT_NAME environment variable, or "default-project" if the environment variable is not set
)


# this can be any Pydantic evaluator
@dataclass
class ComprehensiveCheck(Evaluator):
def evaluate(self, ctx: EvaluatorContext) -> EvaluatorOutput:
format_valid = self._check_format(ctx.output)

to_return = {
'valid_format': EvaluationReason(
value=format_valid,
reason='Valid JSON format' if format_valid else 'Invalid JSON format',
),
'quality_score': self._score_quality(ctx.output),
'category': self._classify(ctx.output),
}
return to_return

def _check_format(self, output: str) -> bool:
return output.startswith('{') and output.endswith('}')

def _score_quality(self, output: str) -> float:
return len(output) / 100.0

def _classify(self, output: str) -> str:
return 'short' if len(output) < 50 else 'long'

# This can be any Pydantic ReportEvaluator that returns ScalarResult
class TotalCasesEvaluator(ReportEvaluator):
def evaluate(self, ctx: ReportEvaluatorContext) -> ScalarResult:
return ScalarResult(
title='Total',
value=len(ctx.report.cases),
unit='cases',
)

dataset = LLMObs.create_dataset(
dataset_name="capitals-of-the-world",
project_name="capitals-project", # optional, defaults to project_name used in LLMObs.enable
description="Questions about world capitals",
records=[
{
"input_data": {
"question": "What is the capital of China?"
}, # required, JSON or string
"expected_output": "Beijing", # optional, JSON or string
"metadata": {"difficulty": "easy"}, # optional, JSON
},
{
"input_data": {
"question": "Which city serves as the capital of South Africa?"
},
"expected_output": "Pretoria",
"metadata": {"difficulty": "medium"},
},
],
)

def task(input_data: Dict[str, Any], config: Optional[Dict[str, Any]] = None, metadata: Optional[Dict[str, Any]] = None) -> str:
question = input_data['question']
# Your LLM or processing logic here
return "Beijing" if "China" in question else "Unknown"

experiment = LLMObs.experiment(
name="<EXPERIMENT_NAME>",
task=my_task,
dataset=dataset,
evaluators=[EqualsExpected(), ComprehensiveCheck()],
summary_evaluators=[TotalCasesEvaluator()],
description="<EXPERIMENT_DESCRIPTION>",
)


results = experiment.run(jobs=4, raise_errors=True)

print(f"View experiment: {experiment.url}")
```

### Usage
After you run an experiment with a Pydantic evaluation, you can view the Pydantic evaluation results per instance in the corresponding experiment run in Datadog. In the experiment below, two Pydantic evaluations (a custom Pydantic evaluator with the name "ComprehensiveCheck" and a built-in evaluator with the name "EqualsExpected") and one Pydantic report evaluator (a custom Pydantic report evaluator with the name "TotalCasesEvaluator") were run:

{{< img src="llm_observability/pydantic-experiment-result.png" alt="An LLM Observability experiment with a Pydantic evaluator." style="width:100%;" >}}

## Further reading

{{< partial name="whats-next/whats-next.html" >}}

[1]: /llm_observability/experiments
[2]: /llm_observability/experiments/setup#create-an-experiment
[3]: https://ai.pydantic.dev/evals/
[4]: /llm_observability/experiments/setup#create-a-dataset
[5]: /llm_observability/experiments/datasets
[6]: https://github.com/DataDog/llm-observability/blob/main/experiments/eval-integrations/2-pydantic-demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ The typical flow:

## Building evaluators

There are two ways to define an evaluator using LLM Observability: class-based and function-based. In addition to these evaluators, LLM Observability has integrations with open source evaluation frameworks, such as [DeepEval][6], that can be used in LLM Observability Experiments.
There are two ways to define an evaluator using LLM Observability: class-based and function-based. In addition to these evaluators, LLM Observability has integrations with open source evaluation frameworks, such as [DeepEval][6] and [Pydantic][], that can be used in LLM Observability Experiments.

| | Class-based | Function-based |
|---|---|---|
Expand Down Expand Up @@ -702,3 +702,4 @@ When submitting evaluations for [OpenTelemetry-instrumented spans][3], include t
[5]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations
[6]: /llm_observability/evaluations/deepeval_evaluations/
[7]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations#configure-the-prompt
[8]: /llm_observability/evaluations/pydantic_evaluations
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading