Conference Scanning Skill
Efficiently scan 500+ conference papers to identify low-cost research opportunities.
Methodology for efficiently scanning 500+ conference papers to find low-cost research opportunities.
Overview
Used by /research --scout "CVPR 2025" --budget=8h to identify papers that:
- Fit within a given GPU budget
- Have open-source code
- Are relevant to the user's research domain
- Have high extension potential
Scanning Strategy
Funnel Approach — QA-Based Analysis
Stage 1: Paper List Fetch (all papers, ~8 min for CVF venues)
→ CVF venues (CVPR/ICCV/ECCV): CVF Open Access (primary) + S2 (abstracts)
→ Non-CVF venues: Semantic Scholar Bulk API (primary)
→ raw-papers.json with title, abstract, authors, pdf_url, paper_id
Stage 2: Cost-First Screening (all papers, ~seconds)
→ Load keyword lists from ~/.claude/research-pipeline/keywords/
→ KILL only if heavy_score >= 2 AND light_score == 0
→ Rank by: cost_score + domain_bonus + code_bonus
→ Top 300 → screened.json
Stage 3: PDF Download to Global Cache (200-400 papers, Master Agent)
→ Check .research/paper-cache/ — skip already-cached papers
→ Batch download: curl -sL → pdftotext → .research/paper-cache/txt/{paper_id}.txt
→ Remove PDFs after conversion (keep only TXT)
→ Update paper-cache/index.json
Stage 4: QA-Based Analysis (dispatched to opportunity-scorer, model: sonnet)
→ Read paper text from paper-cache/txt/{paper_id}.txt
→ Answer 6 questions: method, training, hardware, feasibility, code, value
→ Output: papers-analyzed-batch{N}.json (single file per batch)
Stage 5: Merge (Master Agent, Python)
→ Merge batches → papers-analyzed.json (atomic write)
→ Sort by: feasibility_verdict + research_value
→ Delete batch files
Stage 5.5: Mechanical Feasibility (Python, zero LLM cost)
→ python3 research_utils.py verify_feasibility papers-analyzed.json
→ Adds: mechanical_verdict, mechanical_flags, verdict_disagrees
→ Catches: compute gap (32×H100→1×4090), VRAM overflow (multi-model), time overflow
Stage 5.6: Second-Pass Verification (Sonnet, high-value/flagged papers only)
→ Trigger: research_value≥8 OR verdict_disagrees OR mechanical_flags non-empty
→ Re-read paper with mechanical arithmetic results as constraints
→ Updates feasibility_verdict (preserves original_verdict for audit)
Stage 6: Report Generation (Master Agent)
→ Read papers-analyzed.json → report.md
Dispatch Prompt Templates
Stage 1: Paper List Fetch (Inline Script — No Worker Dispatch)
Venue-aware strategy — Master Agent runs this directly:
CVF venues (CVPR/ICCV/ECCV):
Phase A: wget CVF Open Access listing page → parse HTML (title, authors, PDF URL)
Expected: CVPR ~2800+, ICCV ~2000+, ECCV ~2400+
If < 100 papers → CVF not published yet, fall back to S2
Phase B: S2 Bulk API (paginated) → title-match to enrich with abstracts
GET /paper/search/bulk?query=&venue={venue}&year={year}
&fields=title,abstract,externalIds,openAccessPdf
Match rate: ~65% (S2 indexing lags behind CVF)
Phase C: For unmatched papers → parallel scrape CVF paper pages (5 workers)
Each page has <div id="abstract">...</div>
~6 min for ~1000 pages
Non-CVF venues (ICLR/NeurIPS/ICML):
Primary: Semantic Scholar Bulk API (covers most papers with abstracts)
TODO: OpenReview API v2 enrichment
Output: $SCOUT_DIR/raw-papers.json
Fields: title, authors, abstract, pdf_url, paper_id, arxiv_id, doi, s2_paper_id
See research.md Step 1 for the complete inline Python script.
Stage 2: Cost-First Keyword Screening Script
# Cost-first screening (Master Agent runs this inline via python3 -c)
# Key design decisions:
# - KILL only when confirmed heavy AND no lightweight signal (heavy>=2 & light==0)
# - Domain match is a BONUS (0-3), NOT a gate — low-cost papers survive without domain match
# - Deterministic: same input → same output, no LLM judgment
# - Top 300 fixed cutoff for reproducibility
import json, os
kw_dir = os.path.expanduser('~/.claude/research-pipeline/keywords')
with open(f'{kw_dir}/cv-domains.json') as f:
domains = json.load(f)['domains']
with open(f'{kw_dir}/lightweight-signals.json') as f:
lightweight = json.load(f)
with open(f'{kw_dir}/heavy-signals.json') as f:
heavy = json.load(f)
papers = json.load(open('$SCOUT_DIR/raw-papers.json'))
results = []
for p in papers:
text = f\"{p.get('title','')} {p.get('abstract','')}\".lower()
# Lightweight signals (positive cost indicators)
light_score, light_matches = 0.0, []
for name, sig in lightweight['positive_signals'].items():
if any(kw.lower() in text for kw in sig['keywords']):
light_score += sig['weight']
light_matches.append(name)
# Heavy signals (negative cost indicators)
heavy_score, heavy_matches = 0, []
for name, sig in heavy['negative_signals'].items():
if any(kw.lower() in text for kw in sig['keywords']):
heavy_score += 1
heavy_matches.append(name)
# KILL condition: confirmed heavy AND no lightweight signal
if heavy_score >= 2 and light_score == 0:
continue
# Cost score: reward lightweight signals, bonus for zero heavy
cost_score = min(light_score * 3, 9.0) + (1.0 if heavy_score == 0 else 0)
# Domain bonus (0-3): additive, NOT a gate
domain_matches = []
for dname, dinfo in domains.items():
if any(kw.lower() in text for kw in dinfo.get('include', [])):
domain_matches.append(dname)
domain_bonus = min(len(domain_matches), 3)
# Code availability bonus (0-2)
code_bonus = 2 if ('github.com' in text or 'code available' in text or 'code is available' in text) else 0
composite = cost_score + domain_bonus + code_bonus
results.append({
'title': p.get('title',''), 'authors': p.get('authors',''),
'abstract': p.get('abstract',''), 'pdf_url': p.get('pdf_url',''),
'paper_id': p.get('paper_id', p.get('forum','')),
'arxiv_id': p.get('arxiv_id',''), 's2_paper_id': p.get('s2_paper_id',''),
'cost_score': round(cost_score, 1), 'domain_bonus': domain_bonus,
'domain_matches': domain_matches, 'code_bonus': code_bonus,
'composite_score': round(composite, 1),
'light_matches': light_matches, 'heavy_matches': heavy_matches
})
results.sort(key=lambda x: x['composite_score'], reverse=True)
top = results[:300]
import sys
sys.path.insert(0, os.path.expanduser('~/.claude/scripts/lib'))
from research_utils import atomic_json_write
atomic_json_write(top, '$SCOUT_DIR/screened.json')
killed = len(papers) - len(results)
print(f'Screening: {len(papers)} total → {killed} killed (heavy>=2 & light==0) → {len(results)} survived → top 300 saved')
print(f'With lightweight signals: {sum(1 for r in top if r[\"light_matches\"])}/{len(top)}')
print(f'No domain match: {sum(1 for r in top if r[\"domain_bonus\"]==0)}/{len(top)}')
Stage 4: opportunity-scorer Dispatch
Copy-paste ready template:
Task tool → subagent_type: "general-purpose", model: "sonnet"
name: "opportunity-scorer"
Prompt: "First, Read ~/.claude/agents/opportunity-scorer.md and follow those instructions exactly.
You are analyzing papers from {venue} {year} for research opportunities.
HARDWARE: {hardware_description} (e.g., 1x RTX 4090 24GB)
BUDGET: {budget_hours}h total time
INPUT:
- Read $SCOUT_DIR/screened.json for paper metadata
- Process ONLY these paper_ids: [{comma-separated list}]
- Paper text pre-cached at: $PAPER_CACHE/txt/{paper_id}.txt
For EACH paper_id:
1. Read $PAPER_CACHE/txt/{paper_id}.txt (pre-downloaded — do NOT download PDFs)
2. Answer 6 questions:
Q1: Method summary (2-3 sentences)
Q2: Training requirement? GPU setup? (Quote paper)
Q3: Hardware & Compute Profile — RAW extraction:
paper_gpu_type, paper_gpu_count, paper_training_hours,
largest_model_params_b, num_models_simultaneous,
peak_vram_reported_gb, reported_gpu_setup
Q4: Feasibility on {hardware} within {budget_hours}h?
MANDATORY ARITHMETIC: ratio = gpu_count × SPEED[gpu] / SPEED[4090=0.55]
Multi-model VRAM: Σ(params×2×1.2)
(LLM gen ≈ 100-300× forward, diffusion ≈ 20-50× forward)
Q5: Code URL?
Q6: Research value 0-10?
OUTPUT — write ONE file: $SCOUT_DIR/papers-analyzed-batch{N}.json
Format: [{paper_id, title, method_summary, requires_training,
paper_gpu_type, paper_gpu_count, paper_training_hours, reported_gpu_setup,
largest_model_params_b, num_models_simultaneous, peak_vram_reported_gb,
feasibility_verdict, feasibility_reasoning, estimated_hours,
code_url, research_value, research_value_reasoning, pdf_status, schema_version: '3.1.0'}]
Return ONLY a brief summary (NOT full JSON).
CRITICAL: paper_gpu_count num_models_simultaneous paper_cache
NEVER fabricate. NEVER download PDFs. Write ONE file only."
GPU Cost Estimation from Papers
Papers typically report GPU info in "Implementation Details" or "Experiments" section:
Common patterns to look for:
- "We train on N× [GPU model] for T hours"
- "Training takes T hours on [GPU]"
- "All experiments are conducted on [GPU] with batch size B"
- "Total training cost: N GPU-hours"
If GPU info is not found: The opportunity-scorer marks the field as null and sets feasibility_verdict: "insufficient_info".
QA-Based Scoring
Instead of a weighted formula, the opportunity-scorer (Sonnet) reads each paper and directly answers:
| Question | Output Fields | Why QA > Formula |
|---|---|---|
| Q1: What method? | method_summary | LLM summarizes better than keyword extraction |
| Q2: Training needed? | requires_training, reported_gpu_setup | Reasoning needed, not pattern matching |
| Q3: Hardware profile? | paper_gpu_type, paper_gpu_count, paper_training_hours, largest_model_params_b, num_models_simultaneous, peak_vram_reported_gb | RAW extraction — LLM reads varied formats, Python does arithmetic |
| Q4: Feasible? | feasibility_verdict, estimated_hours | Mandatory arithmetic first (v1.4), then LLM judgment with numbers |
| Q5: Code? | code_url | WebSearch + verification |
| Q6: Worth it? | research_value | Holistic judgment across dimensions |
Output Format
Data Files (in .research/scouts/{venue}/)
raw-papers.json ← Stage 1: raw API data
screened.json ← Stage 2: keyword-filtered candidates
papers-analyzed.json ← Stage 4/5/5.5/5.6: QA results + mechanical verification
report.md ← Stage 6: human-readable report
metadata.json ← Pipeline metadata (version, model, timestamps)
Report Format (report.md)
# Scout Report: {venue} {year}
## Summary
- Total papers: N
- After keyword screening: M
- After QA analysis: K
- Feasible: J
- Tight: L
- Not feasible: P
- Insufficient info: Q
## Top 10 Opportunities
| Rank | Title | Verdict | Est. Hours | Training? | GPU Setup | Code | Value |
|------|-------|---------|-----------|-----------|-----------|------|-------|
## Topic Clusters
### Cluster 1: {topic_name}
- Paper A: verdict, reasoning, code status, value
- Paper B: ...
## Budget Analysis
- Feasible papers: N (with reasoning)
- Tight papers: M (what makes them tight)
- Not feasible: K (why not)
Data Sources
| Source | Use For | Rate Limit |
|---|---|---|
| Semantic Scholar Bulk API | Paper lists + abstracts | 4500/5min (no key) or 1 RPS (with key) |
| OpenReview API v2 | Paper lists (all top venues) | ~500ms interval |
| CVF Open Access | CVPR/ECCV PDFs | 2s interval (polite crawl) |
| arXiv | Preprint PDFs | 20/min |
Related
- Agent:
opportunity-scorer— executes the QA-based evaluation pipeline - Keywords:
~/.claude/research-pipeline/keywords/— domain + cost signal keyword lists - Command:
/research --scout— entry point