Reproduction Validation Skill
Methodology for validating codebase reproduction and ensuring experimental integrity.
Methodology for validating codebase reproduction and ensuring experimental integrity.
Overview
Ensures that:
- Baseline reproduction metrics match paper-reported values
- Code modifications are verified before full training
- Experimental results are genuine (not fabricated)
Reproduction Quality Tiers
| Deviation from Paper | Tier | Action |
|---|---|---|
| <5% | Excellent | Proceed to experiments |
| 5-15% | Acceptable | Note deviation, likely GPU/seed variation |
| >15% | Investigate | Check config, data, environment |
Common Reproduction Failure Modes
Based on Raff (2019) study of 255 papers:
Environment Issues
- CUDA version mismatch: Paper uses CUDA 11.x, repo requires 12.x
- Fix: install matching PyTorch build for your CUDA version
- Undocumented preprocessing: Data needs preprocessing not mentioned in README
- Fix: check
data/directory for scripts, read Issues for hints
- Fix: check
- Missing dependencies: requirements.txt is incomplete
- Fix: run code, install missing packages iteratively
- Hardcoded paths:
/home/author/data/in source code- Fix: Grep for hardcoded paths, replace with config values
Training Issues
- Unreported hyperparameters: Paper doesn't mention weight decay, warmup steps, etc.
- Fix: check codebase config files/defaults, reference similar papers
- Unreported data augmentation: Significant augmentation not mentioned in paper
- Fix: read data loading code carefully
- Random seed sensitivity: Results vary ±2-3% across seeds
- Fix: run 3× with different seeds, report mean ± std
- Pretrained weight version: Different checkpoint version gives different baseline
- Fix: verify exact checkpoint URL/hash from repo
Data Issues
- Dataset version drift: Dataset updated since paper was published
- Fix: check for version tags, download specific version
- Train/val split mismatch: Different split ratios or random seeds
- Fix: use exact split files from repo (often in
data/splits/)
- Fix: use exact split files from repo (often in
- Preprocessing pipeline: Normalization, resizing, cropping differences
- Fix: compare preprocessing code vs paper description
Three-Layer Defense System
Layer 1: Micro-Experiment Verification
After every code modification, before any training:
✓ Forward pass: output shape correct, no NaN/Inf
✓ Gradient flow: all new parameters have non-zero gradients
✓ Micro-training: 100 steps, loss decreases
✓ Before/after: same batch produces different outputs
Cost: <5 minutes. Catches: ~90% of code-level errors.
Layer 2: Reproduction Comparison
After baseline training completes:
✓ Compare metric values vs paper Table X
✓ Check evaluation protocol matches (same split, same metric)
✓ Record deviation and assess tier (excellent/acceptable/investigate)
Layer 3: Anti-Fabrication (Mechanical Enforcement)
✓ results.json is ONLY written by training scripts (hook-protected)
✓ sanity-check.json is ONLY written by validation scripts (hook-protected)
✓ LLM can Read these files but NOT Write/Edit them
✓ If results.json doesn't exist → report "experiment incomplete"
This is CODE-LEVEL enforcement via PreToolUse hook, not a prompt-level request.
Reproducibility Checklist
Before reporting any experimental result, verify:
- Environment is locked (environment-lock.json exists)
- Random seeds are fixed and recorded
- Config files used for this run are saved
- Training log exists and shows no errors
- Evaluation uses same protocol as baseline
- Results come from training script output (not manually entered)
- Deviation from paper baseline is documented
Fairness Constraints
For any experiment comparison:
- Same data augmentation for our method AND baseline
- Same hyperparameter search budget
- Report: Method | Metric | Params | FLOPs | Train Time
- Baseline re-run with same augmentation/tricks if we add any
Related
- Agent:
experiment-runner— executes experiments with anti-fabrication constraints - Agent:
env-setup— builds reproducible environments - Reference:
micro-experiment.md— detailed verification protocol - Reference:
error-taxonomy.md— error classification and recovery