From dd393d833f48563e6afe10428c697d216005cf4d Mon Sep 17 00:00:00 2001 From: Waqas Javed <7674577+w-javed@users.noreply.github.com> Date: Fri, 3 Oct 2025 15:33:31 -0700 Subject: [PATCH 01/12] linkfix --- .../samples/evaluator_catalog/README.md | 134 ++++++++++++++++++ 1 file changed, 134 insertions(+) create mode 100644 sdk/evaluation/azure-ai-evaluation/samples/evaluator_catalog/README.md diff --git a/sdk/evaluation/azure-ai-evaluation/samples/evaluator_catalog/README.md b/sdk/evaluation/azure-ai-evaluation/samples/evaluator_catalog/README.md new file mode 100644 index 000000000000..13911e134c90 --- /dev/null +++ b/sdk/evaluation/azure-ai-evaluation/samples/evaluator_catalog/README.md @@ -0,0 +1,134 @@ + +# How to publish new built-in evaluator in Evaluator Catalog. + +This guide helps our partners to bring their evaluators into Microsoft provided Evaluator Catalog in Next Gen UI. + +## Context + +We are building an Evaluator Catalog, that will allow us to store built-in evaluators (provided by Microsoft), as well as 1P/3P customer's provided evaluators. + +We are also building Evaluators CRUD API and SDK experience which can be used by our external customer to create custom evaluators. NextGen UI will leverage these new APIs to list evaluators in Evaluation Section. + +These custom evaluators are also stored in the Evaluator Catalog, but the scope these evaluator will be at project level at Ignite. Post Ignite, we'll allow customers to share their evaluators among different projects. + +This evaluator catalog is backed by Generic Asset Service (that provides versioning support as well as scalable and multi-region support to store all your assets in CosmosDB). + +Types of Built_in Evaluators +There are 3 types of evaluators we support as Built-In Evaluators. + +1. Code Based - It contains Python file +2. Code + Prompt Based - It contains Python file & Prompty file +3. Prompt Based - It contains only Prompty file. +4. Service Based - It references the evaluator in RAI Service that calls fine tuned models provided by Data Science Team. + +## Step 1: Run Your evaluators with Evaluation SDK. + +Create builtin evaluator and use azure-ai-evaluation SDK to run locally. +List of evaluators can be found at [here](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators) + +## Step 2: Provide your evaluator + +We are storing all the builtin evaluators in Azureml-asset Repo. Please provide your evaluators files by creating a PR in this repo. Please follow the steps. + +1. Add a new folder with name as the Evaluator name. + +2. Please include following files. + +* asset.yaml +* spec.yaml +* 'evaluator' folder. + +This 'evaluator' folder contains two files. +1. Python file name should be same as evaluator name with '_' prefix. +2. Prompty file name should be same as evaluator name with .prompty extension. + +Example: Coherence evaluator contains 2 files. +_coherence.py +coherence.prompty + +Please look at existing built-in evaluators for reference. +* Evaluator Catalog Repo : [/assets/evaluators/builtin](https://github.com/Azure/azureml-assets/tree/main/assets/evaluators/builtin) +* Sample PR: [PR 1816050](https://msdata.visualstudio.com/Vienna/_git/azureml-asset/pullrequest/1816050) + +3. Please copy asset.yaml from sample. No change is required. + +4. Please follow steps given below to create spec.yaml. + +## Asset Content - spec.yaml + +| Asset Property | API Property | Example | Description | +| - | - | - | - | +| type | type | evaluator | It is always 'evaluator'. It identifies type of the asset. | +| name | name | test.f1_score| Name of the evaluator, alway in URL | +| version | version | 1 | It is auto incremented version number, starts with 1 | +| displayName: | display name | F1 Score | It is the name of the evaluator shown in UI | +| description: | description | | This is description of the evaluator. | +| evaluatorType: | evaluator_type | "builtin"| For Built-in evaluators, value is "builtin". For custom evaluators, value is "custom". API only supports 'custom'| +| evaluatorSubType | definition.type | "code" | It represents what type of evaluator It is. For #1 & #2 type evaluators, please add "code". For #3 type evaluator, please provide "prompt". For #4 type evaluator, please provide "service" | +| categories | categories | ["Quality"] | The categories of the evaluator. It's an array. Allowed values are Quality, Safety, Agents. Multiple values are allowed | +| initParameterSchema | init_parameters | | The JSON schema (Draft 2020-12) for the evaluator's input parameters. This includes parameters like type, properties, required. | +| dataMappingSchema | data_schema | | The JSON schema (Draft 2020-12) for the evaluator's input data. This includes parameters like type, properties, required. | +| outputSchema | metrics | | List of output metrics produced by this evaluator | +| path | Not expose in API | ./evaluator | Fixed. | + +Example: + +```yml + +type: "evaluator" +name: "test.bleu_score" +version: 1 +displayName: "Bleu-Score-Evaluator" +description: "| | |\n| -- | -- |\n| Score range | Float [0-1]: higher means better quality. |\n| What is this metric? | BLEU (Bilingual Evaluation Understudy) score is commonly used in natural language processing (NLP) and machine translation. It measures how closely the generated text matches the reference text. |\n| How does it work? | The BLEU score calculates the geometric mean of the precision of n-grams between the model-generated text and the reference text, with an added brevity penalty for shorter generated text. The precision is computed for unigrams, bigrams, trigrams, etc., depending on the desired BLEU score level. The more n-grams that are shared between the generated and reference texts, the higher the BLEU score. |\n| When to use it? | The recommended scenario is Natural Language Processing (NLP) tasks. It's widely used in text summarization and text generation use cases. |\n| What does it need as input? | Response, Ground Truth |\n" +evaluatorType: "builtin" +evaluatorSubType: "code" +categories: ["quality"] +initParameterSchema: + type: "object" + properties: + threshold: + type: "number" + minimum: 0 + maximum: 1 + multipleOf: 0.1 + required: ["threshold"] +dataMappingSchema: + type: "object" + properties: + ground_truth: + type: "string" + response: + type: "string" + required: ["ground_truth", "response"] +outputSchema: + bleu: + type: "continuous" + desirable_direction: "increase" + min_value: 0 + max_value: 1 +path: ./evaluator +``` + +## Step 3: Test in RAI Service ACA Code. + +Once PR is reviewed and merged, Evaluation Team will use your evaluator files to run them in ACA to make sure no errors found. You also need to provide jsonl dataset files for testing. + +## Step 4: Publish on Dev Registry (Azureml-dev) +Evaluation Team will kick off the CI Pipeline to publish evaluator in the Evaluator Catalog in azureml-dev (dev) registry. + +## Step 5: Test is INT Environment +Team will verify following: + +1. Verify if new evaluator is available in Evaluator REST APIs. +2. Verify if there are rendered correctly in NextGen UI. +3. Verify if Evaluation API (Eval Run and Open AI Eval) both are able to reference these evaluators from Evaluator Catalog and run in ACA. + +## Step 6: Publish on Prod Registry (Azureml) +Evaluation Team will be able to kick off the CI Pipeline again to publish evaluator in the Evaluator Catalog in azureml (prod) registry. + +## Step 7: Test is Prod Environment +Team will verify following items: + +1. Verify if new evaluator is available in Evaluator REST APIs. +2. Verify if there are rendered correctly in NextGen UI. +3. Verify if Evaluation API (Eval Run and Open AI Eval) both are able to reference these evaluators from Evaluator Catalog and run in ACA. From 8e5cb87624b78ef0ec0f01a1c5cc33df066ab7bf Mon Sep 17 00:00:00 2001 From: Sydney Lister Date: Wed, 5 Nov 2025 15:27:45 -0500 Subject: [PATCH 02/12] Update changelog and version --- sdk/evaluation/azure-ai-evaluation/CHANGELOG.md | 13 +++++++++++++ .../azure/ai/evaluation/_version.py | 2 +- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md b/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md index e6cf6e4427d1..67916186104c 100644 --- a/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md +++ b/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md @@ -1,5 +1,18 @@ # Release History +## 1.13.1 (Unreleased) + +### Features Added + +- Improved RedTeam coverage across risk sub-categories to ensure comprehensive security testing +- Made RedTeam's `AttackStrategy.Tense` seed prompts dynamic to allow use of this strategy with additional risk categories +- Refactors error handling and result semantics in the RedTeam evaluation system to improve clarity and align with Attack Success Rate (ASR) conventions (passed=False means attack success) + +### Bugs Fixed + +- Fixed RedTeam evaluation error related to context handling for context-dependent risk categories +- Fixed RedTeam prompt application for model targets during Indirect Jailbreak XPIA (Cross-Platform Indirect Attack) + ## 1.13.0 (2025-10-30) ### Features Added diff --git a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py index 1f50f4bb4803..f9acb65d93f5 100644 --- a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py +++ b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py @@ -3,4 +3,4 @@ # --------------------------------------------------------- # represents upcoming version -VERSION = "1.13.0" +VERSION = "1.13.1" From ecf44a88734c6eb6f17e85bb8bf1d2ea5cc2f24e Mon Sep 17 00:00:00 2001 From: Sydney Lister Date: Wed, 5 Nov 2025 20:39:47 -0500 Subject: [PATCH 03/12] disable depends --- sdk/evaluation/azure-ai-evaluation/pyproject.toml | 1 + 1 file changed, 1 insertion(+) diff --git a/sdk/evaluation/azure-ai-evaluation/pyproject.toml b/sdk/evaluation/azure-ai-evaluation/pyproject.toml index 59c0eba3e7f8..dcdfae675396 100644 --- a/sdk/evaluation/azure-ai-evaluation/pyproject.toml +++ b/sdk/evaluation/azure-ai-evaluation/pyproject.toml @@ -5,6 +5,7 @@ pylint = false black = true verifytypes = false whl_no_aio = false +depends = false [tool.isort] profile = "black" From 1d6fe0e9758ed7d300d350a9632b7806e9360161 Mon Sep 17 00:00:00 2001 From: Sydney Lister Date: Wed, 5 Nov 2025 21:03:01 -0500 Subject: [PATCH 04/12] add changelog date re-enable depends --- sdk/evaluation/azure-ai-evaluation/CHANGELOG.md | 2 +- sdk/evaluation/azure-ai-evaluation/pyproject.toml | 1 - 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md b/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md index 67916186104c..4824a6589d50 100644 --- a/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md +++ b/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md @@ -1,6 +1,6 @@ # Release History -## 1.13.1 (Unreleased) +## 1.13.1 (2025-11-05) ### Features Added diff --git a/sdk/evaluation/azure-ai-evaluation/pyproject.toml b/sdk/evaluation/azure-ai-evaluation/pyproject.toml index dcdfae675396..59c0eba3e7f8 100644 --- a/sdk/evaluation/azure-ai-evaluation/pyproject.toml +++ b/sdk/evaluation/azure-ai-evaluation/pyproject.toml @@ -5,7 +5,6 @@ pylint = false black = true verifytypes = false whl_no_aio = false -depends = false [tool.isort] profile = "black" From cb904f22c9775ce992105f0f9c2de876613d1bd7 Mon Sep 17 00:00:00 2001 From: Sydney Lister Date: Wed, 5 Nov 2025 21:05:55 -0500 Subject: [PATCH 05/12] re-disable depends --- sdk/evaluation/azure-ai-evaluation/pyproject.toml | 1 + 1 file changed, 1 insertion(+) diff --git a/sdk/evaluation/azure-ai-evaluation/pyproject.toml b/sdk/evaluation/azure-ai-evaluation/pyproject.toml index 59c0eba3e7f8..dcdfae675396 100644 --- a/sdk/evaluation/azure-ai-evaluation/pyproject.toml +++ b/sdk/evaluation/azure-ai-evaluation/pyproject.toml @@ -5,6 +5,7 @@ pylint = false black = true verifytypes = false whl_no_aio = false +depends = false [tool.isort] profile = "black" From ed132403d038876647d43f3811bed82221a2c943 Mon Sep 17 00:00:00 2001 From: Sydney Lister Date: Fri, 7 Nov 2025 11:58:41 -0500 Subject: [PATCH 06/12] Hotfix azure-ai-evaluation 1.13.2 --- .../azure-ai-evaluation/CHANGELOG.md | 6 ++ .../azure/ai/evaluation/_version.py | 2 +- .../azure/ai/evaluation/red_team/_red_team.py | 6 +- .../evaluation/red_team/_result_processor.py | 90 +++++++++++++++++++ .../azure-ai-evaluation/pyproject.toml | 1 - 5 files changed, 101 insertions(+), 4 deletions(-) diff --git a/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md b/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md index 4824a6589d50..e06eeb5dcbb3 100644 --- a/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md +++ b/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md @@ -1,5 +1,11 @@ # Release History +## 1.13.2 (2025-11-07) + +### Bugs Fixed + +- Added App Insights redaction for agent safety run telemetry so adversarial prompts are not stored in collected logs. + ## 1.13.1 (2025-11-05) ### Features Added diff --git a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py index f9acb65d93f5..33ba8b031100 100644 --- a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py +++ b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py @@ -3,4 +3,4 @@ # --------------------------------------------------------- # represents upcoming version -VERSION = "1.13.1" +VERSION = "1.13.2" diff --git a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_red_team.py b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_red_team.py index a658a7ce9951..1081d4e4ddac 100644 --- a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_red_team.py +++ b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_red_team.py @@ -1640,9 +1640,11 @@ async def _finalize_results( # Extract AOAI summary for passing to MLflow logging aoai_summary = red_team_result.scan_result.get("AOAI_Compatible_Summary") if self._app_insights_configuration: - emit_eval_result_events_to_app_insights( - self._app_insights_configuration, aoai_summary["output_items"]["data"] + # Get redacted results from the result processor for App Insights logging + redacted_results = self.result_processor.get_app_insights_redacted_results( + aoai_summary["output_items"]["data"] ) + emit_eval_result_events_to_app_insights(self._app_insights_configuration, redacted_results) # Log results to MLFlow if not skipping upload if not skip_upload: self.logger.info("Logging results to AI Foundry") diff --git a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py index dd9a922fc0f8..6aa03ea2a76e 100644 --- a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py +++ b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py @@ -7,6 +7,7 @@ This module handles the processing, aggregation, and formatting of red team evaluation results. """ +import copy import hashlib import json import math @@ -18,6 +19,8 @@ import pandas as pd +from azure.ai.evaluation._common.constants import EvaluationMetrics + # Local imports from ._red_team_result import ( RedTeamResult, @@ -1616,3 +1619,90 @@ def _build_results_payload( } return run_payload + + def get_app_insights_redacted_results(self, results: List[Dict]) -> List[Dict]: + """ + Creates a redacted copy of results specifically for App Insights logging. + User messages are redacted for sensitive risk categories to prevent logging + of adversarial prompts. + + Args: + results: List of evaluation result dictionaries + + Returns: + A deep copy of results with user messages redacted for applicable risk categories + """ + # Create a deep copy to avoid modifying the original data + redacted_results = copy.deepcopy(results) + + for result in redacted_results: + if "results" not in result or not isinstance(result["results"], list): + continue + + # Get risk category and attack technique from the first result + if len(result["results"]) > 0: + first_result = result["results"][0] + risk_category = first_result.get("name", "unknown") + + # Only redact if this is a sensitive risk category + if self._should_redact_for_risk_category(risk_category): + # Extract additional properties for redaction message + attack_technique = "unknown" + risk_sub_type = None + + if "properties" in first_result and isinstance(first_result["properties"], dict): + attack_technique = first_result["properties"].get("attack_technique", "unknown") + risk_sub_type = first_result["properties"].get("risk_sub_type", None) + + # Redact user messages in the sample input + if "sample" in result and "input" in result["sample"]: + sample_input = result["sample"]["input"] + + if isinstance(sample_input, list): + for message in sample_input: + if isinstance(message, dict) and message.get("role") == "user": + message["content"] = self._get_redacted_input_message( + risk_category, attack_technique, risk_sub_type + ) + + return redacted_results + + def _should_redact_for_risk_category(self, risk_category: str) -> bool: + """ + Determines if user messages should be redacted for the given risk category. + + Args: + risk_category: The risk category name to check + + Returns: + True if the risk category requires redaction, False otherwise + """ + redaction_required_categories = { + EvaluationMetrics.PROHIBITED_ACTIONS, + EvaluationMetrics.TASK_ADHERENCE, + EvaluationMetrics.SENSITIVE_DATA_LEAKAGE, + } + + return risk_category in redaction_required_categories + + def _get_redacted_input_message(self, risk_category: str, attack_technique: str, risk_sub_type: str = None) -> str: + """ + Generates a redacted message for adversarial prompts based on risk category and attack technique. + + Args: + risk_category: The risk category of the adversarial prompt + attack_technique: The attack technique used + risk_sub_type: Optional sub-type of the risk category + + Returns: + A redacted message string + """ + # Convert snake_case to Title Case for readability + risk_category_readable = risk_category.replace("_", " ").replace("-", " ").title() + attack_technique_readable = attack_technique.replace("_", " ").replace("-", " ").title() + + if risk_sub_type: + risk_sub_type_readable = risk_sub_type.replace("_", " ").replace("-", " ").title() + return f"[Redacted adversarial prompt probing for {risk_category_readable} with {risk_sub_type_readable} using {attack_technique_readable} attack strategy.]" + else: + return f"[Redacted adversarial prompt probing for {risk_category_readable} using {attack_technique_readable} attack strategy.]" diff --git a/sdk/evaluation/azure-ai-evaluation/pyproject.toml b/sdk/evaluation/azure-ai-evaluation/pyproject.toml index dcdfae675396..59c0eba3e7f8 100644 --- a/sdk/evaluation/azure-ai-evaluation/pyproject.toml +++ b/sdk/evaluation/azure-ai-evaluation/pyproject.toml @@ -5,7 +5,6 @@ pylint = false black = true verifytypes = false whl_no_aio = false -depends = false [tool.isort] profile = "black" From e12b9e4814e48c6c13545c8e52324d4603ba882e Mon Sep 17 00:00:00 2001 From: Waqas Javed <7674577+w-javed@users.noreply.github.com> Date: Sat, 8 Nov 2025 15:43:51 -0800 Subject: [PATCH 07/12] version change --- sdk/evaluation/azure-ai-evaluation/CHANGELOG.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md b/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md index 57c713194b70..c82de7b8e9d8 100644 --- a/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md +++ b/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md @@ -1,5 +1,12 @@ # Release History +## 1.13.3 (2025-11-08) + +### Bugs Fixed + +- Added hardcoded "scenario": "redteam" field to the query_response dictionary for all red team conversation evaluations +- Extract the scenario field from data and include it in the evaluation properties, following the same pattern as category and taxonomy fields + ## 1.13.2 (2025-11-07) ### Bugs Fixed From 07074c1e839ddf137274d45fca3449b181f962a3 Mon Sep 17 00:00:00 2001 From: Waqas Javed <7674577+w-javed@users.noreply.github.com> Date: Sat, 8 Nov 2025 15:48:31 -0800 Subject: [PATCH 08/12] version change --- sdk/evaluation/azure-ai-evaluation/CHANGELOG.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md b/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md index c82de7b8e9d8..a234f9fb9367 100644 --- a/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md +++ b/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md @@ -4,8 +4,9 @@ ### Bugs Fixed -- Added hardcoded "scenario": "redteam" field to the query_response dictionary for all red team conversation evaluations -- Extract the scenario field from data and include it in the evaluation properties, following the same pattern as category and taxonomy fields +- Added redteam scenario for red team evaluations and extracted to properties, following the same pattern as category and taxonomy fields +- Added _is_primary_metric function and integrated primary metric filtering into _calculate_aoai_evaluation_summary +- Updated documentation for _EvaluatorMetricMapping and reordered rouge_score metrics ## 1.13.2 (2025-11-07) From ce5ea2d28445645baa392c410000ec817e6bf9b5 Mon Sep 17 00:00:00 2001 From: Sydney Lister Date: Mon, 10 Nov 2025 22:24:55 -0500 Subject: [PATCH 09/12] Update changelog for 1.13.5 release --- sdk/evaluation/azure-ai-evaluation/CHANGELOG.md | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md b/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md index 5ec3319ef162..bab53a929cdb 100644 --- a/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md +++ b/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md @@ -1,17 +1,11 @@ # Release History -## 1.13.5 (Unreleased) - -### Features Added - -### Breaking Changes +## 1.13.5 (2025-11-10) ### Bugs Fixed - **TaskAdherenceEvaluator:** treat tool definitions as optional so evaluations with only query/response inputs no longer raise “Either 'conversation' or individual inputs must be provided.” -### Other Changes - ## 1.13.4 (2025-11-10) ### Bugs Fixed From 5fc885a64c59303793bdc3aade987db4f9420d7a Mon Sep 17 00:00:00 2001 From: Waqas Javed <7674577+w-javed@users.noreply.github.com> Date: Wed, 12 Nov 2025 20:45:50 -0800 Subject: [PATCH 10/12] Eval-SDK-Hotfix-1-13-6 --- sdk/evaluation/azure-ai-evaluation/CHANGELOG.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md b/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md index 5ec3319ef162..7aad8ed0975c 100644 --- a/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md +++ b/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md @@ -1,5 +1,13 @@ # Release History +## 1.13.6 (2025-11-12) + +### Bugs Fixed + +- Added detection and retry handling for network errors wrapped in generic exceptions with "Error sending prompt with conversation ID" message +- Fix results for ungrounded_attributes +- score_mode grader improvements + ## 1.13.5 (Unreleased) ### Features Added From fe12602506125ee58f0f32b90f48284f5a1c922c Mon Sep 17 00:00:00 2001 From: Waqas Javed <7674577+w-javed@users.noreply.github.com> Date: Thu, 13 Nov 2025 10:52:08 -0800 Subject: [PATCH 11/12] fix --- .../samples/evaluator_catalog/README.md | 134 ------------------ 1 file changed, 134 deletions(-) delete mode 100644 sdk/evaluation/azure-ai-evaluation/samples/evaluator_catalog/README.md diff --git a/sdk/evaluation/azure-ai-evaluation/samples/evaluator_catalog/README.md b/sdk/evaluation/azure-ai-evaluation/samples/evaluator_catalog/README.md deleted file mode 100644 index 13911e134c90..000000000000 --- a/sdk/evaluation/azure-ai-evaluation/samples/evaluator_catalog/README.md +++ /dev/null @@ -1,134 +0,0 @@ - -# How to publish new built-in evaluator in Evaluator Catalog. - -This guide helps our partners to bring their evaluators into Microsoft provided Evaluator Catalog in Next Gen UI. - -## Context - -We are building an Evaluator Catalog, that will allow us to store built-in evaluators (provided by Microsoft), as well as 1P/3P customer's provided evaluators. - -We are also building Evaluators CRUD API and SDK experience which can be used by our external customer to create custom evaluators. NextGen UI will leverage these new APIs to list evaluators in Evaluation Section. - -These custom evaluators are also stored in the Evaluator Catalog, but the scope these evaluator will be at project level at Ignite. Post Ignite, we'll allow customers to share their evaluators among different projects. - -This evaluator catalog is backed by Generic Asset Service (that provides versioning support as well as scalable and multi-region support to store all your assets in CosmosDB). - -Types of Built_in Evaluators -There are 3 types of evaluators we support as Built-In Evaluators. - -1. Code Based - It contains Python file -2. Code + Prompt Based - It contains Python file & Prompty file -3. Prompt Based - It contains only Prompty file. -4. Service Based - It references the evaluator in RAI Service that calls fine tuned models provided by Data Science Team. - -## Step 1: Run Your evaluators with Evaluation SDK. - -Create builtin evaluator and use azure-ai-evaluation SDK to run locally. -List of evaluators can be found at [here](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators) - -## Step 2: Provide your evaluator - -We are storing all the builtin evaluators in Azureml-asset Repo. Please provide your evaluators files by creating a PR in this repo. Please follow the steps. - -1. Add a new folder with name as the Evaluator name. - -2. Please include following files. - -* asset.yaml -* spec.yaml -* 'evaluator' folder. - -This 'evaluator' folder contains two files. -1. Python file name should be same as evaluator name with '_' prefix. -2. Prompty file name should be same as evaluator name with .prompty extension. - -Example: Coherence evaluator contains 2 files. -_coherence.py -coherence.prompty - -Please look at existing built-in evaluators for reference. -* Evaluator Catalog Repo : [/assets/evaluators/builtin](https://github.com/Azure/azureml-assets/tree/main/assets/evaluators/builtin) -* Sample PR: [PR 1816050](https://msdata.visualstudio.com/Vienna/_git/azureml-asset/pullrequest/1816050) - -3. Please copy asset.yaml from sample. No change is required. - -4. Please follow steps given below to create spec.yaml. - -## Asset Content - spec.yaml - -| Asset Property | API Property | Example | Description | -| - | - | - | - | -| type | type | evaluator | It is always 'evaluator'. It identifies type of the asset. | -| name | name | test.f1_score| Name of the evaluator, alway in URL | -| version | version | 1 | It is auto incremented version number, starts with 1 | -| displayName: | display name | F1 Score | It is the name of the evaluator shown in UI | -| description: | description | | This is description of the evaluator. | -| evaluatorType: | evaluator_type | "builtin"| For Built-in evaluators, value is "builtin". For custom evaluators, value is "custom". API only supports 'custom'| -| evaluatorSubType | definition.type | "code" | It represents what type of evaluator It is. For #1 & #2 type evaluators, please add "code". For #3 type evaluator, please provide "prompt". For #4 type evaluator, please provide "service" | -| categories | categories | ["Quality"] | The categories of the evaluator. It's an array. Allowed values are Quality, Safety, Agents. Multiple values are allowed | -| initParameterSchema | init_parameters | | The JSON schema (Draft 2020-12) for the evaluator's input parameters. This includes parameters like type, properties, required. | -| dataMappingSchema | data_schema | | The JSON schema (Draft 2020-12) for the evaluator's input data. This includes parameters like type, properties, required. | -| outputSchema | metrics | | List of output metrics produced by this evaluator | -| path | Not expose in API | ./evaluator | Fixed. | - -Example: - -```yml - -type: "evaluator" -name: "test.bleu_score" -version: 1 -displayName: "Bleu-Score-Evaluator" -description: "| | |\n| -- | -- |\n| Score range | Float [0-1]: higher means better quality. |\n| What is this metric? | BLEU (Bilingual Evaluation Understudy) score is commonly used in natural language processing (NLP) and machine translation. It measures how closely the generated text matches the reference text. |\n| How does it work? | The BLEU score calculates the geometric mean of the precision of n-grams between the model-generated text and the reference text, with an added brevity penalty for shorter generated text. The precision is computed for unigrams, bigrams, trigrams, etc., depending on the desired BLEU score level. The more n-grams that are shared between the generated and reference texts, the higher the BLEU score. |\n| When to use it? | The recommended scenario is Natural Language Processing (NLP) tasks. It's widely used in text summarization and text generation use cases. |\n| What does it need as input? | Response, Ground Truth |\n" -evaluatorType: "builtin" -evaluatorSubType: "code" -categories: ["quality"] -initParameterSchema: - type: "object" - properties: - threshold: - type: "number" - minimum: 0 - maximum: 1 - multipleOf: 0.1 - required: ["threshold"] -dataMappingSchema: - type: "object" - properties: - ground_truth: - type: "string" - response: - type: "string" - required: ["ground_truth", "response"] -outputSchema: - bleu: - type: "continuous" - desirable_direction: "increase" - min_value: 0 - max_value: 1 -path: ./evaluator -``` - -## Step 3: Test in RAI Service ACA Code. - -Once PR is reviewed and merged, Evaluation Team will use your evaluator files to run them in ACA to make sure no errors found. You also need to provide jsonl dataset files for testing. - -## Step 4: Publish on Dev Registry (Azureml-dev) -Evaluation Team will kick off the CI Pipeline to publish evaluator in the Evaluator Catalog in azureml-dev (dev) registry. - -## Step 5: Test is INT Environment -Team will verify following: - -1. Verify if new evaluator is available in Evaluator REST APIs. -2. Verify if there are rendered correctly in NextGen UI. -3. Verify if Evaluation API (Eval Run and Open AI Eval) both are able to reference these evaluators from Evaluator Catalog and run in ACA. - -## Step 6: Publish on Prod Registry (Azureml) -Evaluation Team will be able to kick off the CI Pipeline again to publish evaluator in the Evaluator Catalog in azureml (prod) registry. - -## Step 7: Test is Prod Environment -Team will verify following items: - -1. Verify if new evaluator is available in Evaluator REST APIs. -2. Verify if there are rendered correctly in NextGen UI. -3. Verify if Evaluation API (Eval Run and Open AI Eval) both are able to reference these evaluators from Evaluator Catalog and run in ACA. From c1af25e08576c8d93443f3e0e309004f355373cf Mon Sep 17 00:00:00 2001 From: Waqas Javed <7674577+w-javed@users.noreply.github.com> Date: Thu, 13 Nov 2025 13:26:04 -0800 Subject: [PATCH 12/12] add version --- .../azure-ai-evaluation/azure/ai/evaluation/_version.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py index 34f6c88bb7c6..bae6572dac32 100644 --- a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py +++ b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py @@ -3,4 +3,4 @@ # --------------------------------------------------------- # represents upcoming version -VERSION = "1.13.5" +VERSION = "1.13.6"