Idle Harness

Idle Harness is a GAN-inspired multi-agent system that autonomously builds full-stack web apps from a single prompt.

Idle Harness

GAN-inspired multi-agent system that autonomously builds full-stack web apps from a single prompt using Claude AI agents

What is Idle Harness?

Idle Harness is an autonomous multi-agent coding system inspired by GAN (Generative Adversarial Network) architecture. It takes a short natural-language prompt and automatically generates a complete full-stack web application — frontend, backend, database, and styling — without human intervention.

The system orchestrates three specialized AI agents (Planner, Generator, and Evaluator) that collaborate through a structured build-evaluate-iterate loop. Like a GAN's generator-discriminator dynamic, the Generator builds the application while the Evaluator tests it as a real user would — without ever reading the source code. This adversarial relationship drives quality: the Generator can't cut corners because the Evaluator will catch it.

Built on Anthropic's harness design for long-running apps and powered by the Claude Agent SDK.

Quick Start

git clone https://github.com/jhlee0409/idle-harness.git
cd idle-harness
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Interactive setup — configures auth automatically
python orchestrator.py --setup

# Build an app
python orchestrator.py "A tarot reading web app with card-draw animations and AI interpretations"

# After build — start the app
python orchestrator.py serve

That's it. If anything is missing, the harness detects it and offers to fix it automatically.

Setup

Idle Harness includes a built-in interactive setup that detects and configures all dependencies.

Option A: Auto-detect on first run

Just run the harness. If dependencies are missing, it offers to fix them:

$ python orchestrator.py "my app idea"

Preflight checks
  ✓ claude_agent_sdk
  ✓ node (v20.11.0)
  ✓ npm (10.2.4)
  ✓ git (2.43.0)
  ✗ auth — No auth configured
  ✓ MCP: playwright (SDK-managed via npx)

  Fix 1 issue(s) automatically? [Y/n]: y

  Choose auth method:
    [o] OAuth login (uses subscription quota)
    [a] API key (pay per use, no quota limit)
  Choose: o
  → Running: claude login
  ✓ OAuth authenticated

All issues fixed

Option B: Explicit setup

python orchestrator.py --setup

Runs all checks and auto-fixes everything without asking.

Option C: CI / Non-interactive

CI=1 python orchestrator.py "my app idea"

Fails hard with exit code 1 if any dependency is missing. No interactive prompts.

Other commands

# Start the last-built app (opens browser automatically)
python orchestrator.py serve

# Clean runtime artifacts for a fresh run
python orchestrator.py clean

# Clean everything including generated apps
python orchestrator.py clean --all

What gets checked

CheckAuto-fixableHow
claude_agent_sdkYespip install claude-agent-sdk
node, npm, gitNoPrints install link
Claude CLINoPrints install link
Auth (OAuth or API key)Yesclaude login or API key input
Playwright MCPAutomaticSDK launches via npx — no user config needed

Authentication options

MethodHow to configureCost model
OAuthclaude login (interactive setup handles this)Uses subscription quota (Pro/Max plan)
API keySet ANTHROPIC_API_KEY env varPay per token, no quota limit

How It Works

User Prompt (1-4 sentences)
    ↓
┌─────────┐     ┌───────────┐     ┌───────────┐
│ Planner │ ──→ │ Generator │ ←─→ │ Evaluator │
│         │     │           │     │           │
│ Spec    │     │ React+TS  │     │ Browser   │
│ Design  │     │ Vite      │     │ Testing   │
│ Language│     │ FastAPI   │     │ Screenshot│
│         │     │ SQLite    │     │ Grading   │
└─────────┘     └───────────┘     └───────────┘
                      ↕
              Build → Evaluate → Feedback Loop (max 3 rounds)
  1. Plan — Planner reads the frontend design skill, then expands the prompt into a full product spec with visual design language
  2. Negotiate — Generator and Evaluator negotiate sprint contracts with testable criteria
  3. Build — Generator implements the full-stack app in TypeScript + FastAPI, writes and runs tests (continuous session preserves context across retries)
  4. Evaluate — Evaluator tests the running app via Playwright, grading on product depth, functionality, visual design, and code quality
  5. Iterate — On FAIL, feedback is returned to the Generator for another attempt (up to 3 rounds)
  6. Integration — After all sprints, a final cross-sprint evaluation verifies the complete application works together

The GAN Principle

The Evaluator never reads source code. It can only interact with the running application through a browser — clicking buttons, filling forms, taking screenshots. This mirrors how a GAN's discriminator only sees the output, never the generator's internals. The result: the Generator must produce genuinely working software, not just code that looks correct.

Agents

AgentRoleKey Behavior
PlannerPrompt → Product SpecReads frontend design skill, defines visual design language, explores AI integration, high-level technical design (no implementation details)
GeneratorSpec → Full-Stack ImplementationReact+Vite+TypeScript+FastAPI+SQLite, writes tests (pytest+vitest), self-evaluates before handoff
EvaluatorBrowser-Tests the Running AppNever reads source code (GAN principle), screenshot evidence, detects stubs/fakes, grades on 4 full-stack criteria

Evaluation Criteria

CriterionWeightDescription
Product DepthHighAre features complete and real, or surface-level stubs?
FunctionalityHighDo core interactions work end-to-end with database persistence?
Visual DesignNormalDoes the app match the spec's visual design language?
Code QualityNormalStability, error handling, edge case behavior

Configuration

Editable in config.py:

SettingDefaultDescription
modefullfull (sprints + contracts + iteration) / simple (single build + eval)
max_build_attempts3Max build→evaluate retry rounds
max_negotiation_rounds3Max contract negotiation rounds
generator_max_turns200Max turns for Generator agent
dev_server_urlhttp://localhost:5173Frontend server URL
mcp_toolplaywrightEvaluator browser testing tool

Project Structure

idle-harness/
├── orchestrator.py      # Main orchestration loop + preflight + setup
├── cli.py               # Claude Agent SDK wrapper
├── config.py            # Settings (mode, servers, limits)
├── state.py             # State management (status.json)
├── server.py            # Dev server start/stop
├── sprint.py            # Sprint parsing
├── agents/
│   ├── planner.md                # Planner system prompt
│   ├── generator.md              # Generator system prompt
│   ├── evaluator.md              # Evaluator system prompt
│   └── frontend-design-skill.md  # Design skill (Planner reads at runtime)
├── tests/               # pytest tests
├── comms/               # Runtime artifacts (spec, contracts, evaluations)
└── output/              # Generated applications

FAQ

What can I build with Idle Harness?

Any full-stack web application that can be described in a few sentences. Examples: a tarot reading app with AI interpretations, an AI-powered bookmark manager, a recipe finder with dietary filters, a personal finance tracker, a kanban board with drag-and-drop.

How is this different from other AI code generators?

Most AI code generators produce code in a single pass. Idle Harness uses a multi-agent adversarial loop: one agent builds, another independently evaluates the running application (not the code), and feedback drives iterative improvement. This is closer to how a development team works — with separate roles for implementation and quality assurance.

How long does it take?

Typical: 30 minutes to 2 hours depending on complexity. A 4-sprint app with retries can take 3+ hours but produces significantly better results than a single-pass generation.

What does it cost?

Depends on your auth method:

  • OAuth (subscription): Uses your Claude Pro/Max quota. A typical 2-sprint app uses roughly 30-60 minutes of agent time.
  • API key: Pay per token. A typical run costs $50-200 depending on complexity and retries.

Does the simple mode skip sprints?

Yes. simple mode builds the entire app in one pass and runs a single evaluation. It still retries up to 3 times on failure, but skips sprint decomposition and contract negotiation. Good for simpler apps or faster iteration.

License

MIT