chore: add data-designer skill evals#718
Conversation
Review: PR #718 —
|
Greptile SummaryAdds
|
| Filename | Overview |
|---|---|
| skills/data-designer/evals/evals.json | New eval file with 4 positive autopilot evals and 2 negative routing evals; one positive eval explicitly names the skill in the question (bypassing routing), and three autopilot evals omit the "agent context" step check required by autopilot.md. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
Q[Incoming question] --> R{Skill router}
R -->|natural language: person reviews| DD1[data-designer-autopilot-person-reviews]
R -->|natural language: IoT telemetry| DD2[data-designer-autopilot-sampler-params]
R -->|explicit: 'in autopilot'| DD3[data-designer-autopilot-llm-judge-scores]
R -->|explicit: 'Use the data-designer skill'| DD4[data-designer-autopilot-support-tickets]
R -->|PostgreSQL admin| NEG1[data-designer-negative-database-admin\nexpected_skill: null]
R -->|React UI| NEG2[data-designer-negative-react-component\nexpected_skill: null]
DD1 --> AW1[Autopilot workflow\n+ person-sampling reference]
DD2 --> AW2[Autopilot workflow\n+ sampler type/params checks]
DD3 --> AW3[Autopilot workflow\n+ LLM judge score .score checks]
DD4 --> AW4[Autopilot workflow\n+ validate + preview]
AW1 & AW2 & AW3 & AW4 --> ES[expected_skill: data-designer]
NEG1 & NEG2 --> EN[expected_skill: null]
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
skills/data-designer/evals/evals.json:4
**Explicit skill name in question bypasses routing verification**
The question for `data-designer-autopilot-support-tickets` opens with "Use the data-designer skill to create…", which is a direct invocation rather than natural language. The other positive evals ("Create a synthetic e-commerce product review dataset…", "Generate a synthetic IoT sensor telemetry dataset…") rely on natural phrasing so the router must select the skill on its own. With the skill named explicitly the `expected_skill` assertion will always pass regardless of whether the router would choose `data-designer` for a generic support-ticket request, making this case unable to catch routing regressions for that scenario.
### Issue 2 of 2
skills/data-designer/evals/evals.json:24-36
**"Agent context" step not verified in three of four autopilot evals**
`autopilot.md` step 2 requires running `data-designer agent context` ("Learn" step) before writing the script — this is where the agent inspects column schemas and avoids guessing types. Only `data-designer-autopilot-support-tickets` includes `"The agent ran data-designer agent context before writing the script"` in `expected_behavior`. The remaining three autopilot evals (`data-designer-autopilot-person-reviews`, `data-designer-autopilot-llm-judge-scores`, and `data-designer-autopilot-sampler-params`) do not assert this step was taken, so an agent that skips the Learn step would still pass those evals as long as its output happens to be correct.
Reviews (6): Last reviewed commit: "test: add data-designer skill evals" | Re-trigger Greptile
0a5e916 to
b6cd817
Compare
|
/nvskills-ci |
|
@johnnygreco - I think this is failing because its missing the DCO sign-off. Run git rebase --signoff origin/main && git push --force-with-lease |
4b9fc72 to
d0f0a40
Compare
Signed-off-by: Johnny Greco <jogreco@nvidia.com>
d0f0a40 to
467a900
Compare
|
/nvskills-ci |
📋 Summary
Adds targeted eval coverage for the
data-designerskill so Autopilot routing and skill-specific behaviors are easier to verify. The cases focus on Data Designer workflow use, person sampling, LLM judge score access, sampler params, and unrelated negative prompts.🔗 Related Issue
N/A
🔄 Changes
skills/data-designer/evals/evals.jsonwith focused positive evals for Autopilot dataset generation scenarios.🧪 Testing
make testpasses — not run; eval JSON onlypython3 -m json.tool skills/data-designer/evals/evals.jsonpasses✅ Checklist