snapeval

snapeval is a harness-agnostic evaluation runner for agentskills.io skills, enabling performance benchmarking.

snapeval

Harness-agnostic eval runner for agentskills.io skills.

CI npm version License: MIT

snapeval runs every eval case with and without your skill, grades assertions, and computes a benchmark delta — so you can see exactly what value your skill adds.

snapeval — greeter
Baseline = without SKILL.md (raw AI response)
────────────────────────────────────────────────────────────
  #1 formal greeting for Eleanor
    Skill: 100% | Baseline: 33% | 5.2s
  #2 casual greeting for Marcus
    Skill: 100% ↑ was 67% | Baseline: 67% | 2.7s
  #3 pirate greeting for Zoe
    Skill: 100% | Baseline: 67% | 2.5s
────────────────────────────────────────────────────────────
Summary:
  Skill pass rate:    100.0%
  Baseline pass rate: 55.6%
  Improvement:        +44.4%

How it works

  1. You write a SKILL.md and an evals.json with test cases and assertions
  2. snapeval runs each eval twice — once with your skill loaded, once without (baseline)
  3. Assertions are graded by an LLM judge (semantic) and/or shell scripts (deterministic)
  4. A benchmark shows where your skill adds value vs. where the raw AI already handles it

Quick start

As a Copilot plugin

copilot plugin install matantsach/snapeval

Then in Copilot CLI, just say evaluate my skill — the snapeval skill handles the rest.

Standalone CLI

git clone https://github.com/matantsach/snapeval.git
cd snapeval && npm install
npx tsx bin/snapeval.ts eval <skill-dir>

Eval format

my-skill/
├── SKILL.md
└── evals/
    ├── evals.json
    └── scripts/         ← optional deterministic checks
        └── validate.sh

evals.json:

{
  "skill_name": "greeter",
  "evals": [
    {
      "id": 1,
      "label": "formal greeting for Eleanor",
      "prompt": "Can you give me a formal greeting for Eleanor?",
      "expected_output": "Returns the formal greeting addressed to Eleanor.",
      "assertions": [
        "Output contains the name Eleanor",
        "Output uses a formal tone",
        "script:validate.sh"
      ]
    }
  ]
}
FieldRequiredDescription
idyesUnique numeric identifier
promptyesThe user prompt sent to the harness
expected_outputyesHuman description of the expected behavior
labelnoHuman-readable name shown in terminal output
slugnoFilesystem-safe name for the eval directory
assertionsnoList of assertions to grade (LLM semantic or script: prefixed)
filesnoInput files to attach to the prompt

Assertions

Semantic — graded by an LLM. Write specific, verifiable statements:

"Output contains a YAML block with an 'id' field for each issue"
"Response declines because the pipeline already has unclaimed issues"

Script — prefix with script:. Scripts live in evals/scripts/, receive the output directory as $1, and pass on exit code 0:

"script:validate-json-structure.sh"

CLI reference

eval

Run evals, grade assertions, compute benchmark.

npx snapeval eval [skill-dir] [options]
FlagDescriptionDefault
--harness <name>Harness adaptercopilot-sdk
--inference <name>Inference adapter for gradingauto
--workspace <path>Output directory../{skill_name}-workspace
--runs <n>Harness invocations per eval for statistical averaging1
--concurrency <n>Parallel eval cases (1-10)1
--only <ids>Run specific eval IDs (e.g. --only 1,3,5)all
--threshold <rate>Minimum pass rate 0-1 for exit code 0none
--old-skill <path>Compare against old skill versionnone
--feedbackWrite feedback.json template for human reviewoff

Exit codes

CodeMeaning
0Success
1Threshold not met (eval ran but pass rate below --threshold)
2Config/input error (bad JSON, missing fields, invalid flags)
3File not found (missing skill dir, evals.json, or script)
4Runtime error (harness failure, grading failure, timeout)

Output artifacts

Each run creates an iteration directory:

workspace/
└── iteration-1/
    ├── benchmark.json       ← aggregate stats with delta
    ├── SKILL.md.snapshot    ← copy of skill used
    └── eval-{slug}/
        ├── with_skill/
        │   ├── outputs/output.txt
        │   ├── timing.json
        │   ├── grading.json
        │   └── transcript.log
        └── without_skill/
            ├── outputs/output.txt
            ├── timing.json
            └── grading.json

benchmark.json includes metadata: eval_count, eval_ids, skill_name, runs_per_eval, timestamp.

CI integration

name: Skill Evaluation
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
      - run: npm ci
      - run: npx snapeval eval skills/my-skill --threshold 0.8 --runs 3

Exit code 1 when pass rate falls below threshold — blocks the PR.

Configuration

Create snapeval.config.json in your skill or project root:

{
  "harness": "copilot-sdk",
  "inference": "auto",
  "workspace": "../{skill_name}-workspace",
  "runs": 1,
  "concurrency": 1
}

Resolution order: defaults → project config → skill-dir config → CLI flags.

Harness adapters

AdapterDescriptionDefault
copilot-sdkProgrammatic via @github/copilot-sdk with native skill loadingyes
copilot-cliShells out to copilot CLI binaryno

The SDK harness loads skills natively via skillDirectories, captures full transcripts, and extracts real token counts from assistant.usage events.

Inference adapters

AdapterDescription
autoUses @github/copilot-sdk by default, falls back to GitHub Models API
copilot-sdk@github/copilot-sdk programmatic
github-modelsGitHub Models API (requires GITHUB_TOKEN)

Contributing

See CONTRIBUTING.md.

License

MIT