Idle Harness is a GAN-inspired multi-agent system that autonomously builds full-stack web apps from a single prompt.

Idle Harness

GAN-inspired multi-agent system that autonomously builds full-stack web apps from a single prompt using Claude AI agents

What is Idle Harness?

Idle Harness is an autonomous multi-agent coding system inspired by GAN (Generative Adversarial Network) architecture. It takes a short natural-language prompt and automatically generates a complete full-stack web application — frontend, backend, database, and styling — without human intervention.

The system orchestrates three specialized AI agents (Planner, Generator, and Evaluator) that collaborate through a structured build-evaluate-iterate loop. Like a GAN's generator-discriminator dynamic, the Generator builds the application while the Evaluator tests it as a real user would — without ever reading the source code. This adversarial relationship drives quality: the Generator can't cut corners because the Evaluator will catch it.

Built on Anthropic's harness design for long-running apps and powered by the Claude Agent SDK.

Quick Start

git clone https://github.com/jhlee0409/idle-harness.git
cd idle-harness
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Interactive setup — configures auth automatically
python orchestrator.py --setup

# Build an app
python orchestrator.py "A tarot reading web app with card-draw animations and AI interpretations"

# After build — start the app
python orchestrator.py serve

That's it. If anything is missing, the harness detects it and offers to fix it automatically.

Setup

Idle Harness includes a built-in interactive setup that detects and configures all dependencies.

Option A: Auto-detect on first run

Just run the harness. If dependencies are missing, it offers to fix them:

$ python orchestrator.py "my app idea"

Preflight checks
  ✓ claude_agent_sdk
  ✓ node (v20.11.0)
  ✓ npm (10.2.4)
  ✓ git (2.43.0)
  ✗ auth — No auth configured
  ✓ MCP: playwright (SDK-managed via npx)

  Fix 1 issue(s) automatically? [Y/n]: y

  Choose auth method:
    [o] OAuth login (uses subscription quota)
    [a] API key (pay per use, no quota limit)
  Choose: o
  → Running: claude login
  ✓ OAuth authenticated

All issues fixed

Option B: Explicit setup

python orchestrator.py --setup

Runs all checks and auto-fixes everything without asking.

Option C: CI / Non-interactive

CI=1 python orchestrator.py "my app idea"

Fails hard with exit code 1 if any dependency is missing. No interactive prompts.

Other commands

# Start the last-built app (opens browser automatically)
python orchestrator.py serve

# Clean runtime artifacts for a fresh run
python orchestrator.py clean

# Clean everything including generated apps
python orchestrator.py clean --all

What gets checked

Check	Auto-fixable	How
`claude_agent_sdk`	Yes	`pip install claude-agent-sdk`
`node`, `npm`, `git`	No	Prints install link
Claude CLI	No	Prints install link
Auth (OAuth or API key)	Yes	`claude login` or API key input
Playwright MCP	Automatic	SDK launches via npx — no user config needed

Authentication options

Method	How to configure	Cost model
OAuth	`claude login` (interactive setup handles this)	Uses subscription quota (Pro/Max plan)
API key	Set `ANTHROPIC_API_KEY` env var	Pay per token, no quota limit

How It Works

User Prompt (1-4 sentences)
    ↓
┌─────────┐     ┌───────────┐     ┌───────────┐
│ Planner │ ──→ │ Generator │ ←─→ │ Evaluator │
│         │     │           │     │           │
│ Spec    │     │ React+TS  │     │ Browser   │
│ Design  │     │ Vite      │     │ Testing   │
│ Language│     │ FastAPI   │     │ Screenshot│
│         │     │ SQLite    │     │ Grading   │
└─────────┘     └───────────┘     └───────────┘
                      ↕
              Build → Evaluate → Feedback Loop (max 3 rounds)

Plan — Planner reads the frontend design skill, then expands the prompt into a full product spec with visual design language
Negotiate — Generator and Evaluator negotiate sprint contracts with testable criteria
Build — Generator implements the full-stack app in TypeScript + FastAPI, writes and runs tests (continuous session preserves context across retries)
Evaluate — Evaluator tests the running app via Playwright, grading on product depth, functionality, visual design, and code quality
Iterate — On FAIL, feedback is returned to the Generator for another attempt (up to 3 rounds)
Integration — After all sprints, a final cross-sprint evaluation verifies the complete application works together

The GAN Principle

The Evaluator never reads source code. It can only interact with the running application through a browser — clicking buttons, filling forms, taking screenshots. This mirrors how a GAN's discriminator only sees the output, never the generator's internals. The result: the Generator must produce genuinely working software, not just code that looks correct.

Agents

Agent	Role	Key Behavior
Planner	Prompt → Product Spec	Reads frontend design skill, defines visual design language, explores AI integration, high-level technical design (no implementation details)
Generator	Spec → Full-Stack Implementation	React+Vite+TypeScript+FastAPI+SQLite, writes tests (pytest+vitest), self-evaluates before handoff
Evaluator	Browser-Tests the Running App	Never reads source code (GAN principle), screenshot evidence, detects stubs/fakes, grades on 4 full-stack criteria

Evaluation Criteria

Criterion	Weight	Description
Product Depth	High	Are features complete and real, or surface-level stubs?
Functionality	High	Do core interactions work end-to-end with database persistence?
Visual Design	Normal	Does the app match the spec's visual design language?
Code Quality	Normal	Stability, error handling, edge case behavior

Configuration

Editable in config.py:

Setting	Default	Description
`mode`	`full`	`full` (sprints + contracts + iteration) / `simple` (single build + eval)
`max_build_attempts`	`3`	Max build→evaluate retry rounds
`max_negotiation_rounds`	`3`	Max contract negotiation rounds
`generator_max_turns`	`200`	Max turns for Generator agent
`dev_server_url`	`http://localhost:5173`	Frontend server URL
`mcp_tool`	`playwright`	Evaluator browser testing tool

Project Structure

idle-harness/
├── orchestrator.py      # Main orchestration loop + preflight + setup
├── cli.py               # Claude Agent SDK wrapper
├── config.py            # Settings (mode, servers, limits)
├── state.py             # State management (status.json)
├── server.py            # Dev server start/stop
├── sprint.py            # Sprint parsing
├── agents/
│   ├── planner.md                # Planner system prompt
│   ├── generator.md              # Generator system prompt
│   ├── evaluator.md              # Evaluator system prompt
│   └── frontend-design-skill.md  # Design skill (Planner reads at runtime)
├── tests/               # pytest tests
├── comms/               # Runtime artifacts (spec, contracts, evaluations)
└── output/              # Generated applications

FAQ

What can I build with Idle Harness?

Any full-stack web application that can be described in a few sentences. Examples: a tarot reading app with AI interpretations, an AI-powered bookmark manager, a recipe finder with dietary filters, a personal finance tracker, a kanban board with drag-and-drop.

How is this different from other AI code generators?

Most AI code generators produce code in a single pass. Idle Harness uses a multi-agent adversarial loop: one agent builds, another independently evaluates the running application (not the code), and feedback drives iterative improvement. This is closer to how a development team works — with separate roles for implementation and quality assurance.

How long does it take?

Typical: 30 minutes to 2 hours depending on complexity. A 4-sprint app with retries can take 3+ hours but produces significantly better results than a single-pass generation.

What does it cost?

Depends on your auth method:

OAuth (subscription): Uses your Claude Pro/Max quota. A typical 2-sprint app uses roughly 30-60 minutes of agent time.
API key: Pay per token. A typical run costs $50-200 depending on complexity and retries.

Does the `simple` mode skip sprints?

Yes. simple mode builds the entire app in one pass and runs a single evaluation. It still retries up to 3 times on failure, but skips sprint decomposition and contract negotiation. Good for simpler apps or faster iteration.

License

MIT

Idle Harness

Idle Harness

What is Idle Harness?

Quick Start

Setup

Option A: Auto-detect on first run

Option B: Explicit setup

Option C: CI / Non-interactive

Other commands

What gets checked

Authentication options

How It Works

The GAN Principle

Agents

Evaluation Criteria

Configuration

Project Structure

FAQ

What can I build with Idle Harness?

How is this different from other AI code generators?

How long does it take?

What does it cost?

Does the simple mode skip sprints?

License

Does the `simple` mode skip sprints?