
3 results

Set up evaluations for an AI skill from scratch — designs test scenarios, writes evals.json, and runs the first benchmark. Use when no evals exist yet and the user wants to evaluate, test, benchmark, or review a skill. Triggers on "evaluate my skill", "test my skill", "set up evals", "how good is my skill", "benchmark this skill", "create evals for", or any request to assess skill quality when there is no existing evals/evals.json file.

Run and iterate on existing skill evaluations. Use when evals/evals.json already exists and the user wants to run evals, re-evaluate after skill changes, check results, compare iterations, add/modify eval cases, or gate CI with thresholds. Triggers on "run evals", "re-eval", "how did it do", "check results", "compare iterations", "run benchmarks", or any eval-related request when evals already exist.