Skip to content

Commit d5fc2ff

Browse files
committed
Add doc for reading experiment results locally
1 parent d13f312 commit d5fc2ff

File tree

4 files changed

+173
-0
lines changed

4 files changed

+173
-0
lines changed

pipeline/preprocessors/link_map.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -193,6 +193,11 @@ class LinkMap(TypedDict):
193193
"on_llm_new_token": "langchain_core/callbacks/#langchain_core.callbacks.base.AsyncCallbackHandler.on_llm_new_token",
194194
# Rate limiters
195195
"InMemoryRateLimiter": "langchain_core/rate_limiters/#langchain_core.rate_limiters.InMemoryRateLimiter",
196+
# LangSmith SDK
197+
"Client": "langsmith/observability/sdk/client/#langsmith.client.Client",
198+
"Client.evaluate": "langsmith/observability/sdk/client/#langsmith.client.Client.evaluate",
199+
"Client.aevaluate": "langsmith/observability/sdk/client/#langsmith.client.Client.aevaluate",
200+
"Client.get_experiment_results": "langsmith/observability/sdk/client/#langsmith.client.Client.get_experiment_results",
196201
# LangGraph
197202
"get_stream_writer": "langgraph/config/#langgraph.config.get_stream_writer",
198203
"StateGraph": "langgraph/graphs/#langgraph.graph.state.StateGraph",

src/docs.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1065,6 +1065,7 @@
10651065
"langsmith/repetition",
10661066
"langsmith/rate-limiting",
10671067
"langsmith/local",
1068+
"langsmith/read-local-experiment-results",
10681069
"langsmith/langchain-runnable",
10691070
"langsmith/evaluate-graph",
10701071
"langsmith/evaluate-existing-experiment",

src/langsmith/local.mdx

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,10 @@ You can do this by using the LangSmith Python SDK and passing `upload_results=Fa
99

1010
This will run you application and evaluators exactly as it always does and return the same output, but nothing will be recorded to LangSmith. This includes not just the experiment results but also the application and evaluator traces.
1111

12+
<Note>
13+
If you want to upload results to LangSmith but also need to process them in your script (for quality gates, custom aggregations, etc.), refer to [Read experiment results locally](/langsmith/read-local-experiment-results).
14+
</Note>
15+
1216
## Example
1317

1418
Let's take a look at an example:
Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
---
2+
title: How to read experiment results locally
3+
sidebarTitle: Read experiment results locally
4+
---
5+
6+
When running [evaluations](/langsmith/evaluation-concepts), you may want to process results programmatically in your script rather than viewing them in the [LangSmith UI](https://smith.langchain.com). This is useful for scenarios like:
7+
8+
- **CI/CD pipelines**: Implement quality gates that fail builds if evaluation scores drop below a threshold.
9+
- **Local debugging**: Inspect and analyze results without API calls.
10+
- **Custom aggregations**: Calculate metrics and statistics using your own logic.
11+
- **Integration testing**: Use evaluation results to gate merges or deployments.
12+
13+
This guide shows how to read and process [experiment](/langsmith/evaluation-concepts#experiment) results directly from the @[`Client.evaluate()`][Client.evaluate] response.
14+
15+
<Note>
16+
This page focuses on processing results programmatically while still uploading them to LangSmith.
17+
18+
If you want to run evaluations locally **without** recording anything to LangSmith (for quick testing or validation), refer to [Run an evaluation locally](/langsmith/local) which uses `upload_results=False`.
19+
</Note>
20+
21+
## Iterate over evaluation results
22+
23+
The @[`evaluate()`][Client.evaluate] function returns an iterator when called with `blocking=False`. This allows you to process results as they're produced:
24+
25+
```python
26+
from langsmith import Client
27+
import random
28+
29+
client = Client()
30+
31+
def target(inputs):
32+
"""Your application or LLM chain"""
33+
return {"output": "MY OUTPUT"}
34+
35+
def evaluator(run, example):
36+
"""Your evaluator function"""
37+
return {"key": "randomness", "score": random.randint(0, 1)}
38+
39+
# Run evaluation with blocking=False to get an iterator
40+
streamed_results = client.evaluate(
41+
target,
42+
data="MY_DATASET_NAME",
43+
evaluators=[evaluator],
44+
blocking=False
45+
)
46+
47+
# Collect results as they stream in
48+
aggregated_results = []
49+
for result in streamed_results:
50+
aggregated_results.append(result)
51+
52+
# Separate loop to avoid logging at the same time as logs from evaluate()
53+
for result in aggregated_results:
54+
print("Input:", result["run"].inputs)
55+
print("Output:", result["run"].outputs)
56+
print("Evaluation Results:", result["evaluation_results"]["results"])
57+
print("--------------------------------")
58+
```
59+
60+
This produces output like:
61+
62+
```
63+
Input: {'input': 'MY INPUT'}
64+
Output: {'output': 'MY OUTPUT'}
65+
Evaluation Results: [EvaluationResult(key='randomness', score=1, value=None, comment=None, correction=None, evaluator_info={}, feedback_config=None, source_run_id=UUID('7ebb4900-91c0-40b0-bb10-f2f6a451fd3c'), target_run_id=None, extra=None)]
66+
--------------------------------
67+
```
68+
69+
## Understand the result structure
70+
71+
Each result in the iterator contains:
72+
73+
- `result["run"]`: The execution of your target function.
74+
- `result["run"].inputs`: The inputs from your [dataset](/langsmith/evaluation-concepts#datasets) example.
75+
- `result["run"].outputs`: The outputs produced by your target function.
76+
- `result["run"].id`: The unique ID for this run.
77+
78+
- `result["evaluation_results"]["results"]`: A list of `EvaluationResult` objects, one per evaluator.
79+
- `key`: The metric name (from your evaluator's return value).
80+
- `score`: The numeric score (typically 0-1 or boolean).
81+
- `comment`: Optional explanatory text.
82+
- `source_run_id`: The ID of the evaluator run.
83+
84+
- `result["example"]`: The dataset example that was evaluated.
85+
- `result["example"].inputs`: The input values.
86+
- `result["example"].outputs`: The reference outputs (if any).
87+
88+
## Example: Implement a quality gate
89+
90+
This example shows how to use evaluation results to pass or fail a CI/CD build automatically based on quality thresholds. The script iterates through results, calculates an average accuracy score, and exits with a non-zero status code if the accuracy falls below 85%. This ensures that you can deploy code changes that meet quality standards.
91+
92+
```python
93+
from langsmith import Client
94+
import sys
95+
96+
client = Client()
97+
98+
def my_application(inputs):
99+
# Your application logic
100+
return {"response": "..."}
101+
102+
def accuracy_evaluator(run, example):
103+
# Your evaluation logic
104+
is_correct = run.outputs["response"] == example.outputs["expected"]
105+
return {"key": "accuracy", "score": 1 if is_correct else 0}
106+
107+
# Run evaluation
108+
results = client.evaluate(
109+
my_application,
110+
data="my_test_dataset",
111+
evaluators=[accuracy_evaluator],
112+
blocking=False
113+
)
114+
115+
# Calculate aggregate metrics
116+
total_score = 0
117+
count = 0
118+
119+
for result in results:
120+
eval_result = result["evaluation_results"]["results"][0]
121+
total_score += eval_result.score
122+
count += 1
123+
124+
average_accuracy = total_score / count
125+
126+
print(f"Average accuracy: {average_accuracy:.2%}")
127+
128+
# Fail the build if accuracy is too low
129+
if average_accuracy < 0.85:
130+
print("❌ Evaluation failed! Accuracy below 85% threshold.")
131+
sys.exit(1)
132+
133+
print("✅ Evaluation passed!")
134+
```
135+
136+
## Example: Collect results for analysis
137+
138+
Sometimes you may want to collect all results first before processing them. This is useful when you need to perform operations that require the full dataset (like calculating percentiles, sorting by score, or generating summary reports). Collecting results separately also prevents your output from being mixed with the logging from `evaluate()`.
139+
140+
```python
141+
# Collect all results first
142+
all_results = []
143+
for result in client.evaluate(target, data=dataset, evaluators=[evaluator], blocking=False):
144+
all_results.append(result)
145+
146+
# Then process them separately
147+
# (This avoids mixing your print statements with evaluation logs)
148+
for result in all_results:
149+
print("Input:", result["run"].inputs)
150+
print("Output:", result["run"].outputs)
151+
152+
# Access individual evaluation results
153+
for eval_result in result["evaluation_results"]["results"]:
154+
print(f" {eval_result.key}: {eval_result.score}")
155+
```
156+
157+
For more information on running evaluations without uploading results, refer to [Run an evaluation locally](/langsmith/local).
158+
159+
## Related
160+
161+
- [Evaluate your LLM application](/langsmith/evaluate-llm-application)
162+
- [Run an evaluation locally](/langsmith/local)
163+
- [Fetch performance metrics from an experiment](/langsmith/fetch-perf-metrics-experiment)

0 commit comments

Comments
 (0)