Skip to content

Add notebooks demonstrating standalone ATIF evaluation via Python API#1752

Merged
rapids-bot[bot] merged 11 commits intoNVIDIA:developfrom
yczhang-nv:yuchen-eval-python-api-notebook
Mar 7, 2026
Merged

Add notebooks demonstrating standalone ATIF evaluation via Python API#1752
rapids-bot[bot] merged 11 commits intoNVIDIA:developfrom
yczhang-nv:yuchen-eval-python-api-notebook

Conversation

@yczhang-nv
Copy link
Contributor

@yczhang-nv yczhang-nv commented Mar 5, 2026

Description

  • Add eval_atif_standalone.ipynb — demonstrates using EvaluationHarness with builder-constructed RAGAS evaluators against ATIF trajectories, no YAML config needed
  • Add eval_atif_custom_evaluator.ipynb — demonstrates fully standalone evaluation with inline custom evaluators (exact match + tool count), requiring only nvidia-nat-eval with no LLMs, no API keys, and no evaluator plugins

Closes

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • Documentation
    • Added two runnable example notebooks demonstrating standalone ATIF evaluation: one showing a programmatic workflow (plugin registration, builder-driven evaluators, harnessed runs, and resource cleanup) and one demonstrating direct use of custom evaluators with local development guidance.
  • New Features
    • Included example custom evaluators for exact-match scoring and tool-call counting, plus examples showing programmatic configuration and execution of evaluations without YAML.

Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
@yczhang-nv yczhang-nv self-assigned this Mar 5, 2026
@yczhang-nv yczhang-nv added the DO NOT MERGE PR should not be merged; see PR for details label Mar 5, 2026
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@coderabbitai
Copy link

coderabbitai bot commented Mar 5, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds two new example Jupyter notebooks demonstrating standalone ATIF evaluation: one shows programmatic evaluator construction, plugin registration and registry-based resolution; the other shows inline custom evaluators (ExactMatchEvaluator, ToolCountEvaluator) used directly with EvaluationHarness.

Changes

Cohort / File(s) Summary
Standalone ATIF notebook
examples/notebooks/eval_atif_standalone.ipynb
New notebook demonstrating a programmatic ATIF evaluation workflow: dependency installation, ATIF trajectory definition and Pydantic parsing, construction of AtifEvalSample objects, programmatic evaluator Config construction (no YAML), plugin registration/discovery, evaluator resolution via a WorkflowEvalBuilder/registry, and running EvaluationHarness.
Custom evaluator notebook
examples/notebooks/eval_atif_custom_evaluator.ipynb
New notebook adding two inline ATIF-native evaluator implementations (ExactMatchEvaluator, ToolCountEvaluator) and demonstrating direct use of EvaluationHarness with these evaluators (no registry/YAML), including sample trajectories, evaluation invocation, and result printing.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant User as "User / Notebook"
    participant Builder as "WorkflowEvalBuilder\n(type registry)"
    participant Harness as "EvaluationHarness"
    participant Eval as "Evaluator\n(ExactMatch / ToolCount / plugin)"
    participant ATIF as "ATIF Models\n(AtifEvalSample)"

    rect rgba(200,230,255,0.5)
    User->>ATIF: define ATIF trajectories\nand build AtifEvalSample(s)
    end

    rect rgba(200,255,200,0.5)
    User->>Builder: register/discover plugins\nor provide evaluators directly
    Builder->>Harness: resolve evaluator instances
    end

    rect rgba(255,230,200,0.5)
    User->>Harness: invoke run(evaluator(s), AtifEvalSampleList)
    Harness->>Eval: call evaluate_atif_fn(atif_samples)
    Eval-->>Harness: return EvalOutput (items, average)
    Harness-->>User: aggregated evaluation results
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately summarizes the main changes: two notebooks demonstrating standalone ATIF evaluation using the Python API. It is concise (69 characters), descriptive, and uses imperative mood.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

…on-api-notebook

Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
@yczhang-nv yczhang-nv added doc Improvements or additions to documentation non-breaking Non-breaking change and removed DO NOT MERGE PR should not be merged; see PR for details labels Mar 6, 2026
@yczhang-nv yczhang-nv marked this pull request as ready for review March 6, 2026 23:45
@yczhang-nv yczhang-nv requested a review from a team as a code owner March 6, 2026 23:45
@yczhang-nv yczhang-nv marked this pull request as draft March 6, 2026 23:45
@yczhang-nv yczhang-nv changed the title [DO NOT MERGE]Experimental notebook for nat eval Python API Experimental notebook for EvaluationHarness to run with ATIF-native trajectory Mar 6, 2026
@coderabbitai coderabbitai bot added the DO NOT MERGE PR should not be merged; see PR for details label Mar 6, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
examples/notebooks/eval_atif_standalone.ipynb (4)

7-23: Consider documenting or validating NVIDIA_API_KEY usage.

The requirements mention that NVIDIA_API_KEY is needed (line 23), but the code never explicitly references or validates this environment variable. While the NIM client likely reads it automatically, users might benefit from:

  1. A code cell that validates the environment variable is set before running evaluations
  2. Documentation explaining that the NIM client reads this automatically from the environment
🔍 Example validation cell

Add after the installation cell:

import os

if "NVIDIA_API_KEY" not in os.environ:
    raise EnvironmentError(
        "NVIDIA_API_KEY environment variable is required. "
        "Set it with: export NVIDIA_API_KEY='your-key-here'"
    )
print("✓ NVIDIA_API_KEY is set")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/notebooks/eval_atif_standalone.ipynb` around lines 7 - 23, Add an
explicit runtime check and brief note about NVIDIA_API_KEY in the notebook:
insert a small validation cell after the installation cell that imports os,
verifies "NVIDIA_API_KEY" exists in os.environ, and raises a clear
EnvironmentError with instructions to export the key if missing; also add one
short sentence near the "Requirements" section stating that the NIM client reads
the key from the environment and that users must set NVIDIA_API_KEY before
running evaluations so they see the validation message early.

231-239: Consider adding error handling for validation failures.

The model_validate() call can raise pydantic.ValidationError if the ATIF data is malformed. For a demonstration notebook this is acceptable, but production usage should wrap this in a try-except block to provide clear error messages.

🛡️ Example with error handling
 from nat.data_models.atif import ATIFTrajectory
+from pydantic import ValidationError

-trajectory = ATIFTrajectory.model_validate(atif_trajectory_data)
+try:
+    trajectory = ATIFTrajectory.model_validate(atif_trajectory_data)
+except ValidationError as e:
+    print(f"Failed to validate ATIF trajectory: {e}")
+    raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/notebooks/eval_atif_standalone.ipynb` around lines 231 - 239, Wrap
the ATIFTrajectory.model_validate(atif_trajectory_data) call in a try-except
that catches pydantic.ValidationError; on failure print or log a clear error
message including the exception details and the raw atif_trajectory_data (or a
summary) and either exit/raise to stop further processing, then only proceed to
use trajectory (e.g. printing trajectory.schema_version, session_id, agent,
steps, final_metrics) if validation succeeded.

39-47: Consider documenting notebook execution prerequisites.

The installation command uses relative paths (../../packages/) and the --no-deps flag, which assumes:

  1. The notebook is executed from its current directory location
  2. Dependencies are already installed or will be resolved by transitive dependencies

Consider adding a note in the preceding markdown cell about the expected execution context (e.g., "Run this notebook from the repository root or ensure the relative paths resolve correctly").

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/notebooks/eval_atif_standalone.ipynb` around lines 39 - 47, The
markdown before the pip install cell should state the expected execution context
and dependency assumptions: add a short note above the pip command referencing
the install cell (the line starting with "!uv pip install -q --no-deps -e
../../packages/nvidia_nat_eval -e ../../packages/nvidia_nat_ragas") that tells
users to run the notebook from the repository root (or adjust paths), explains
that --no-deps assumes dependencies are already installed or provided
transitively, and provide the alternative released install command ("!uv pip
install nvidia-nat-eval nvidia-nat-ragas") for users who are not developing
locally.

394-416: Consider using async context manager idiomatically.

The code manually calls __aenter__() and later __aexit__() (in cell 9), which works but isn't the idiomatic pattern. Consider using async with to ensure cleanup happens automatically even if an exception occurs.

♻️ Idiomatic async context manager usage

Replace cells 7, 8, and 9 with a single cell:

from nat.plugins.eval.runtime.builder import WorkflowEvalBuilder
from nat.plugins.eval.evaluator.atif_evaluator import AtifEvaluator
from nat.plugins.eval.runtime.eval_harness import EvaluationHarness

async with WorkflowEvalBuilder(
    general_config=config.general,
    eval_general_config=config.eval.general,
) as eval_builder:
    await eval_builder.populate_builder(config, skip_workflow=True)
    
    # Collect ATIF evaluators
    atif_evaluators = {}
    for name in config.eval.evaluators:
        evaluator_info = eval_builder.get_evaluator(name)
        if isinstance(evaluator_info, AtifEvaluator):
            atif_evaluators[name] = evaluator_info
            print(f"  [ATIF] {name}: {evaluator_info.description}")
    
    print(f"\nATIF-native evaluators ready: {list(atif_evaluators.keys())}")
    
    # Run evaluation
    harness = EvaluationHarness()
    results = await harness.evaluate(
        evaluators=atif_evaluators,
        atif_samples=atif_samples,
    )
    
    print("=" * 60)
    print("Evaluation Results")
    print("=" * 60)
    for evaluator_name, eval_output in results.items():
        print(f"\n--- {evaluator_name} ---")
        print(f"  Average Score: {eval_output.average_score}")
        for item in eval_output.eval_output_items:
            print(f"  Item {item.id}: score={item.score}")
            if item.reasoning:
                reasoning_str = str(item.reasoning)
                print(f"    reasoning: {reasoning_str[:200]}{'...' if len(reasoning_str) > 200 else ''}")

print("Builder resources released.")

This ensures cleanup happens automatically.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/notebooks/eval_atif_standalone.ipynb` around lines 394 - 416, The
code currently calls eval_builder.__aenter__() and later __aexit__() manually;
change this to the idiomatic async context manager: use "async with
WorkflowEvalBuilder(general_config=config.general,
eval_general_config=config.eval.general) as eval_builder:" then move the call to
await eval_builder.populate_builder(config, skip_workflow=True) and all logic
that collects AtifEvaluator instances (check isinstance(evaluator_info,
AtifEvaluator)) inside that block, removing the explicit __aenter__/__aexit__
calls; also run the evaluation inside the same block (e.g., create
EvaluationHarness and await harness.evaluate(evaluators=atif_evaluators,
atif_samples=atif_samples)) so cleanup happens automatically.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/notebooks/eval_atif_standalone.ipynb`:
- Around line 7-473: Replace all occurrences of the standalone acronym "NAT" in
markdown/documentation text with "NeMo Agent Toolkit" (e.g., change "without
building or running a NAT workflow", "NAT eval can consume", "NAT ships ATIF
Pydantic models", "To use NAT's built-in evaluators") while leaving package
names and code identifiers that include "nat" unchanged and enclosed in
backticks (e.g., `nvidia-nat-eval`, `nvidia-nat-ragas`); update the notebook
title and all markdown cells where the plain token "NAT" appears, but do not
modify any code strings, variable names, or backticked references.
- Around line 1-24: Insert an Apache License 2.0 markdown cell as the very first
cell in the notebook (before the existing "# Standalone ATIF Evaluation with
NeMo Agent Toolkit" title) containing the standard Apache License, Version 2.0
header and a copyright line updated to 2026; ensure the new cell is a markdown
cell so the title and rest of the notebook remain unchanged and that the license
text appears at the top of the notebook.

---

Nitpick comments:
In `@examples/notebooks/eval_atif_standalone.ipynb`:
- Around line 7-23: Add an explicit runtime check and brief note about
NVIDIA_API_KEY in the notebook: insert a small validation cell after the
installation cell that imports os, verifies "NVIDIA_API_KEY" exists in
os.environ, and raises a clear EnvironmentError with instructions to export the
key if missing; also add one short sentence near the "Requirements" section
stating that the NIM client reads the key from the environment and that users
must set NVIDIA_API_KEY before running evaluations so they see the validation
message early.
- Around line 231-239: Wrap the
ATIFTrajectory.model_validate(atif_trajectory_data) call in a try-except that
catches pydantic.ValidationError; on failure print or log a clear error message
including the exception details and the raw atif_trajectory_data (or a summary)
and either exit/raise to stop further processing, then only proceed to use
trajectory (e.g. printing trajectory.schema_version, session_id, agent, steps,
final_metrics) if validation succeeded.
- Around line 39-47: The markdown before the pip install cell should state the
expected execution context and dependency assumptions: add a short note above
the pip command referencing the install cell (the line starting with "!uv pip
install -q --no-deps -e ../../packages/nvidia_nat_eval -e
../../packages/nvidia_nat_ragas") that tells users to run the notebook from the
repository root (or adjust paths), explains that --no-deps assumes dependencies
are already installed or provided transitively, and provide the alternative
released install command ("!uv pip install nvidia-nat-eval nvidia-nat-ragas")
for users who are not developing locally.
- Around line 394-416: The code currently calls eval_builder.__aenter__() and
later __aexit__() manually; change this to the idiomatic async context manager:
use "async with WorkflowEvalBuilder(general_config=config.general,
eval_general_config=config.eval.general) as eval_builder:" then move the call to
await eval_builder.populate_builder(config, skip_workflow=True) and all logic
that collects AtifEvaluator instances (check isinstance(evaluator_info,
AtifEvaluator)) inside that block, removing the explicit __aenter__/__aexit__
calls; also run the evaluation inside the same block (e.g., create
EvaluationHarness and await harness.evaluate(evaluators=atif_evaluators,
atif_samples=atif_samples)) so cleanup happens automatically.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ec77ff3c-5873-4f39-adc1-bc6204af07a2

📥 Commits

Reviewing files that changed from the base of the PR and between 69c21a4 and 57c7d6c.

📒 Files selected for processing (1)
  • examples/notebooks/eval_atif_standalone.ipynb

@yczhang-nv yczhang-nv removed the DO NOT MERGE PR should not be merged; see PR for details label Mar 7, 2026
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
@yczhang-nv yczhang-nv changed the title Experimental notebook for EvaluationHarness to run with ATIF-native trajectory Add notebooks demonstrating standalone ATIF evaluation via Python API Mar 7, 2026
@yczhang-nv yczhang-nv marked this pull request as ready for review March 7, 2026 00:33
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
examples/notebooks/eval_atif_custom_evaluator.ipynb (3)

213-215: Consider removing _count_tool_calls from ExactMatchEvaluator.

This method seems out of place in an "exact match" evaluator. While it adds trajectory_tool_call_count to the reasoning dict (line 234), this metric is tangential to exact-match comparison and duplicates functionality already in ToolCountEvaluator. Consider removing it to keep the evaluator focused on its core responsibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/notebooks/eval_atif_custom_evaluator.ipynb` around lines 213 - 215,
The ExactMatchEvaluator includes a helper _count_tool_calls and adds
trajectory_tool_call_count to the reasoning dict, which duplicates
ToolCountEvaluator and is out of scope for exact-match logic; remove the
_count_tool_calls method from ExactMatchEvaluator and stop adding
trajectory_tool_call_count to the reasoning/result dict (remove the code that
computes/assigns trajectory_tool_call_count inside ExactMatchEvaluator), relying
on ToolCountEvaluator to provide that metric instead.

246-252: Add type hint for sample parameter.

The sample parameter lacks a type annotation. Per coding guidelines, Python methods should use type hints for all parameters.

📝 Suggested fix
-    def _count_tool_calls(self, sample) -> tuple[int, list[str]]:
+    def _count_tool_calls(self, sample: AtifEvalSample) -> tuple[int, list[str]]:

Apply the same fix to ExactMatchEvaluator._count_tool_calls at line 213:

-    def _count_tool_calls(self, sample) -> int:
+    def _count_tool_calls(self, sample: AtifEvalSample) -> int:

As per coding guidelines: "Python methods should use type hints for all parameters and return values."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/notebooks/eval_atif_custom_evaluator.ipynb` around lines 246 - 252,
The method _count_tool_calls is missing a type annotation for its sample
parameter; add an explicit type hint (e.g., sample: EvalSample or the concrete
sample class used in this notebook/repo, or at minimum sample: Any from typing)
and keep the return annotation tuple[int, list[str]] unchanged; apply the same
change to ExactMatchEvaluator._count_tool_calls as well so both functions
declare the sample parameter type consistently.

7-26: Replace "NAT" acronym with "NeMo Agent Toolkit".

Lines 11 and 22 use "NAT" as an acronym. Per coding guidelines, always spell out "NeMo Agent Toolkit" in documentation unless referring to package names in backticks (e.g., nvidia-nat-eval).

📝 Suggested fix
-        "no NAT workflow, and no evaluator plugins required.\n",
+        "no NeMo Agent Toolkit workflow, and no evaluator plugins required.\n",
-        "- `nvidia-nat-eval` works as a standalone package — no API keys, no LLMs, no plugins\n",
+        "- `nvidia-nat-eval` works as a standalone package — no API keys, no LLMs, no evaluator plugins\n",

As per coding guidelines: "Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/notebooks/eval_atif_custom_evaluator.ipynb` around lines 7 - 26,
Replace the bare "NAT" acronym with "NeMo Agent Toolkit" in the notebook text
(e.g., change phrases like "no NAT workflow" and any other plain-text "NAT"
mentions) while leaving package names in backticks (e.g., `nvidia-nat-eval`)
unchanged; update any surrounding wording to keep grammar correct after
substitution.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/notebooks/eval_atif_custom_evaluator.ipynb`:
- Line 300: The call to ExactMatchEvaluator uses the wrong constructor parameter
name `case_sensitive`; change it to use the declared parameter `normalize_case`
and preserve the intended semantics (the original used case_sensitive=False to
request case-insensitive matching), so replace that argument with
normalize_case=True in the ExactMatchEvaluator instantiation to avoid the
TypeError.

---

Nitpick comments:
In `@examples/notebooks/eval_atif_custom_evaluator.ipynb`:
- Around line 213-215: The ExactMatchEvaluator includes a helper
_count_tool_calls and adds trajectory_tool_call_count to the reasoning dict,
which duplicates ToolCountEvaluator and is out of scope for exact-match logic;
remove the _count_tool_calls method from ExactMatchEvaluator and stop adding
trajectory_tool_call_count to the reasoning/result dict (remove the code that
computes/assigns trajectory_tool_call_count inside ExactMatchEvaluator), relying
on ToolCountEvaluator to provide that metric instead.
- Around line 246-252: The method _count_tool_calls is missing a type annotation
for its sample parameter; add an explicit type hint (e.g., sample: EvalSample or
the concrete sample class used in this notebook/repo, or at minimum sample: Any
from typing) and keep the return annotation tuple[int, list[str]] unchanged;
apply the same change to ExactMatchEvaluator._count_tool_calls as well so both
functions declare the sample parameter type consistently.
- Around line 7-26: Replace the bare "NAT" acronym with "NeMo Agent Toolkit" in
the notebook text (e.g., change phrases like "no NAT workflow" and any other
plain-text "NAT" mentions) while leaving package names in backticks (e.g.,
`nvidia-nat-eval`) unchanged; update any surrounding wording to keep grammar
correct after substitution.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: edd55269-8cc0-4e76-ac6c-a2348988b73e

📥 Commits

Reviewing files that changed from the base of the PR and between 57c7d6c and cc8f691.

📒 Files selected for processing (1)
  • examples/notebooks/eval_atif_custom_evaluator.ipynb

Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/notebooks/eval_atif_custom_evaluator.ipynb`:
- Line 11: Update the notebook markdown prose to replace the standalone acronym
"NAT" with the full phrase "NeMo Agent Toolkit" in the affected markdown cells
(e.g., the string containing "no NAT workflow, and no evaluator plugins
required." and the cell around line 323), while leaving any package or code
identifiers that include "nat" (such as `nvidia-nat-eval`) untouched and
properly backticked; ensure you search for other standalone "NAT" occurrences in
the notebook and apply the same replacement to match the documentation
guideline.
- Around line 241-247: The _count_tool_calls method is missing a type annotation
for its sample parameter; update the signature of _count_tool_calls to annotate
sample as AtifEvalSample (the notebook already imports AtifEvalSample and uses
atif_samples: AtifEvalSampleList), e.g., change def _count_tool_calls(self,
sample) -> tuple[int, list[str]] to include sample: AtifEvalSample so static
type checkers and linters recognize the type while leaving the rest of the
method (steps, tool_calls iteration, return) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5952bc86-12e3-4cbf-aeef-9bb879056a2a

📥 Commits

Reviewing files that changed from the base of the PR and between cc8f691 and 26b2dae.

📒 Files selected for processing (1)
  • examples/notebooks/eval_atif_custom_evaluator.ipynb

Copy link
Contributor

@AnuradhaKaruppiah AnuradhaKaruppiah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
examples/notebooks/eval_atif_custom_evaluator.ipynb (1)

198-200: ⚠️ Potential issue | 🟡 Minor

Import and annotate AtifEvalSample in _count_tool_calls.

sample is still untyped here, which violates the repo rule for Python method signatures.

Suggested fix
-from nat.plugins.eval.evaluator.atif_evaluator import AtifEvalSampleList
+from nat.plugins.eval.evaluator.atif_evaluator import AtifEvalSample
+from nat.plugins.eval.evaluator.atif_evaluator import AtifEvalSampleList
@@
-    def _count_tool_calls(self, sample) -> tuple[int, list[str]]:
+    def _count_tool_calls(self, sample: AtifEvalSample) -> tuple[int, list[str]]:

As per coding guidelines: Python methods should use type hints for all parameters and return values.

Also applies to: 237-246

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/notebooks/eval_atif_custom_evaluator.ipynb` around lines 198 - 200,
The _count_tool_calls function currently accepts an untyped parameter named
sample; add a proper type annotation using AtifEvalSample to satisfy the repo's
typing rules: import AtifEvalSample from
nat.plugins.eval.evaluator.atif_evaluator (alongside AtifEvalSampleList) and
change the function signature for _count_tool_calls to accept sample:
AtifEvalSample and annotate the return type as appropriate; apply the same
parameter typing fixes to the other occurrences referenced (around the 237-246
block) so all method signatures are fully typed.
🧹 Nitpick comments (1)
examples/notebooks/eval_atif_standalone.ipynb (1)

305-312: Keep WorkflowEvalBuilder inside a single managed scope.

Calling __aenter__() in one cell and __aexit__() in a later cell leaks resources whenever an intermediate cell fails or is skipped. A single async with block, or at least try/finally, is safer for a demo notebook.

Safer pattern
from nat.plugins.eval.evaluator.atif_evaluator import AtifEvaluator
from nat.plugins.eval.runtime.builder import WorkflowEvalBuilder
from nat.plugins.eval.runtime.eval_harness import EvaluationHarness

async with WorkflowEvalBuilder(
    general_config=config.general,
    eval_general_config=config.eval.general,
) as eval_builder:
    await eval_builder.populate_builder(config, skip_workflow=True)

    atif_evaluators = {}
    for name in config.eval.evaluators:
        evaluator_info = eval_builder.get_evaluator(name)
        if isinstance(evaluator_info, AtifEvaluator):
            atif_evaluators[name] = evaluator_info

    harness = EvaluationHarness()
    results = await harness.evaluate(
        evaluators=atif_evaluators,
        atif_samples=atif_samples,
    )

Also applies to: 380-381

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/notebooks/eval_atif_standalone.ipynb` around lines 305 - 312, The
notebook currently calls WorkflowEvalBuilder.__aenter__() in one cell and
__aexit__() later, which can leak resources; change to use an async context
manager (async with WorkflowEvalBuilder(...) as eval_builder) or wrap the
lifecycle in try/finally so enter/exit are paired in the same scope. Inside that
single managed block, call eval_builder.populate_builder(...) and then build
atif_evaluators via eval_builder.get_evaluator(...) and run
EvaluationHarness.evaluate(...); ensure no manual calls to __aenter__/__aexit__
remain (referenced symbols: WorkflowEvalBuilder, __aenter__, __aexit__,
populate_builder, get_evaluator, EvaluationHarness.evaluate).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/notebooks/eval_atif_standalone.ipynb`:
- Around line 37-45: The editable pip install line currently uses the
"--no-deps" flag which prevents transitive dependencies (like nat-core) from
being installed and breaks later imports; update the notebook cell containing
the command string "!uv pip install -q --no-deps -e
../../packages/nvidia_nat_eval -e ../../packages/nvidia_nat_ragas" to remove the
"--no-deps" flag so dependencies are installed (keep the "-q" and "-e" editable
options intact).

---

Duplicate comments:
In `@examples/notebooks/eval_atif_custom_evaluator.ipynb`:
- Around line 198-200: The _count_tool_calls function currently accepts an
untyped parameter named sample; add a proper type annotation using
AtifEvalSample to satisfy the repo's typing rules: import AtifEvalSample from
nat.plugins.eval.evaluator.atif_evaluator (alongside AtifEvalSampleList) and
change the function signature for _count_tool_calls to accept sample:
AtifEvalSample and annotate the return type as appropriate; apply the same
parameter typing fixes to the other occurrences referenced (around the 237-246
block) so all method signatures are fully typed.

---

Nitpick comments:
In `@examples/notebooks/eval_atif_standalone.ipynb`:
- Around line 305-312: The notebook currently calls
WorkflowEvalBuilder.__aenter__() in one cell and __aexit__() later, which can
leak resources; change to use an async context manager (async with
WorkflowEvalBuilder(...) as eval_builder) or wrap the lifecycle in try/finally
so enter/exit are paired in the same scope. Inside that single managed block,
call eval_builder.populate_builder(...) and then build atif_evaluators via
eval_builder.get_evaluator(...) and run EvaluationHarness.evaluate(...); ensure
no manual calls to __aenter__/__aexit__ remain (referenced symbols:
WorkflowEvalBuilder, __aenter__, __aexit__, populate_builder, get_evaluator,
EvaluationHarness.evaluate).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: adcab0e4-1781-4493-912d-515c05db22a9

📥 Commits

Reviewing files that changed from the base of the PR and between 26b2dae and 0b4f3e2.

📒 Files selected for processing (2)
  • examples/notebooks/eval_atif_custom_evaluator.ipynb
  • examples/notebooks/eval_atif_standalone.ipynb

Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
examples/notebooks/eval_atif_standalone.ipynb (2)

37-45: ⚠️ Potential issue | 🟠 Major

Remove --no-deps from the editable install command.

Line 42 contradicts Line 39: --no-deps skips the transitive packages that provide the later nat.* imports, so this notebook will not bootstrap correctly from a clean environment.

Suggested fix
-!uv pip install -q --no-deps -e ../../packages/nvidia_nat_eval -e ../../packages/nvidia_nat_ragas
+!uv pip install -q -e ../../packages/nvidia_nat_eval -e ../../packages/nvidia_nat_ragas
#!/bin/bash
python <<'PY'
import json
import pathlib
import tomllib

nb_path = pathlib.Path("examples/notebooks/eval_atif_standalone.ipynb")
nb = json.loads(nb_path.read_text())

for idx, cell in enumerate(nb["cells"]):
    src = "".join(cell.get("source", []))
    if "!uv pip install" in src or "from nat." in src:
        print(f"\n=== cell {idx} ===")
        print(src)

for rel in ("packages/nvidia_nat_eval/pyproject.toml", "packages/nvidia_nat_ragas/pyproject.toml"):
    path = pathlib.Path(rel)
    if path.exists():
        data = tomllib.loads(path.read_text())
        print(f"\n=== {rel} dependencies ===")
        for dep in data.get("project", {}).get("dependencies", []):
            print(dep)
PY

Expected result: the install cell shows --no-deps, later cells import nat.*, and the package metadata lists the runtime dependencies that the current command suppresses.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/notebooks/eval_atif_standalone.ipynb` around lines 37 - 45, The
editable pip install line in the notebook cell (the command starting with "!uv
pip install -q --no-deps -e ../../packages/nvidia_nat_eval -e
../../packages/nvidia_nat_ragas") suppresses transitive runtime dependencies and
breaks later "nat.*" imports; remove the "--no-deps" flag from that install
invocation so the packages' declared dependencies (from nvidia_nat_eval and
nvidia_nat_ragas) are installed when bootstrapping the notebook.

1-24: ⚠️ Potential issue | 🟠 Major

Add the SPDX Apache-2.0 header as the first notebook cell.

The notebook still starts directly with the title cell, so the required file-level license header and updated 2026 copyright notice are missing. Please insert a markdown cell before the title with the standard SPDX Apache-2.0 header.

As per coding guidelines: All code should be licensed under the Apache License 2.0, should contain an Apache License 2.0 header comment at the top of each file, and changed files must have up-to-date copyright years.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/notebooks/eval_atif_standalone.ipynb` around lines 1 - 24, Insert a
new first notebook cell (a markdown cell before the existing title cell in the
cells array) containing the SPDX Apache-2.0 header and updated 2026 copyright
notice so the notebook begins with the required file-level license header;
ensure the header is formatted as a markdown cell and placed before the existing
cell that contains the title "# Standalone ATIF Evaluation with NeMo Agent
Toolkit".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@examples/notebooks/eval_atif_standalone.ipynb`:
- Around line 37-45: The editable pip install line in the notebook cell (the
command starting with "!uv pip install -q --no-deps -e
../../packages/nvidia_nat_eval -e ../../packages/nvidia_nat_ragas") suppresses
transitive runtime dependencies and breaks later "nat.*" imports; remove the
"--no-deps" flag from that install invocation so the packages' declared
dependencies (from nvidia_nat_eval and nvidia_nat_ragas) are installed when
bootstrapping the notebook.
- Around line 1-24: Insert a new first notebook cell (a markdown cell before the
existing title cell in the cells array) containing the SPDX Apache-2.0 header
and updated 2026 copyright notice so the notebook begins with the required
file-level license header; ensure the header is formatted as a markdown cell and
placed before the existing cell that contains the title "# Standalone ATIF
Evaluation with NeMo Agent Toolkit".

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9c651d9c-b2e2-421d-b690-965681506a7d

📥 Commits

Reviewing files that changed from the base of the PR and between 0b4f3e2 and f1f1389.

📒 Files selected for processing (1)
  • examples/notebooks/eval_atif_standalone.ipynb

Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
@yczhang-nv
Copy link
Contributor Author

/merge

1 similar comment
@yczhang-nv
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit f33952b into NVIDIA:develop Mar 7, 2026
22 of 23 checks passed
@yczhang-nv yczhang-nv deleted the yuchen-eval-python-api-notebook branch March 7, 2026 07:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc Improvements or additions to documentation non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants