Docling Toolkit for Claude Code

Expert guidance and tooling for document extraction using IBM's Docling library.

Overview

The Docling Toolkit plugin provides comprehensive support for using Docling to extract structured data from documents. It helps you convert PDFs, HTML, and other document formats into clean, citation-rich JSONL files ready for downstream AI processing.

What is Docling?

Docling is an open-source document processing library developed at IBM Research and donated to the LF AI & Data Foundation. It transforms complex documents into structured, machine-readable data with:

Structure-aware chunking: Preserves document hierarchy (sections, paragraphs, tables, figures)
Rich metadata extraction: Automatically captures page numbers, section titles, layout information
Granite model support: Enhanced processing for scanned documents and complex layouts
Enterprise-grade quality: Battle-tested at IBM, 42K+ GitHub stars, 1.5M monthly downloads

Features

Skills (AI-Invoked Autonomously)

Claude will automatically help with Docling when you:

docling-fundamentals: Ask about Docling installation, basic usage, or when to use Docling
docling-chunking: Discuss chunking strategies, HybridChunker vs HierarchicalChunker, metadata extraction
docling-advanced: Work with Granite model, scanned PDFs, complex documents, or performance optimization

Commands (User-Invoked)

/docling-scaffold-processor - Generate production-ready document processing script
/docling-init-project - Initialize Docling extraction project structure
/docling-validate-extracts - Validate extract quality and metadata completeness

Agents (Specialized Assistance)

docling-advisor: Recommends Docling configuration and workflow design for your use case
script-advisor: Helps debug and customize generated Docling scripts

Installation

Prerequisites

Claude Code installed
Python 3.11+ with uv package manager

Docling library:

uv add docling
# or
pip install docling

Install Plugin

# From the Claude-Plugins directory
claude plugin install ./docling-toolkit --scope user

# Or use absolute path
claude plugin install /Users/orlandobruno/Documents/Dev/Claude-Plugins/docling-toolkit --scope user

Verify Installation

claude plugin list
# Should show "docling-toolkit" in the list

Quick Start

1. Initialize a Project

# In your project directory
/docling-init-project my-document-extraction
cd my-document-extraction

This creates:

my-document-extraction/
├── README.md
├── config/
│   └── docling-config.yaml
├── data/
│   ├── raw/              # Place your PDFs/HTML here
│   └── processed/
├── extracts/             # Docling output (JSONL)
├── scripts/
│   ├── process_documents.py
│   └── validate_extracts.py
├── logs/
└── .env.example

2. Generate Processing Script

/docling-scaffold-processor process_documents --input-types pdf,html

This generates a production-ready Python script with:

Docling HybridChunker integration
CLI arguments (--input-dir, --output-file, --granite)
Progress tracking and error handling
Metadata extraction
JSONL output format

3. Process Documents

# Place PDFs in data/raw/ then run:
uv run python scripts/process_documents.py \
  --input-dir data/raw \
  --output-file extracts/output.jsonl

4. Validate Extracts

/docling-validate-extracts extracts/output.jsonl

Gets a quality report with:

Metadata completeness check
Structure preservation validation
Statistics (avg chunk size, source distribution)
Quality metrics and recommendations

Usage Examples

Example 1: Extract from Research Papers

# Initialize project
/docling-init-project research-extraction

# Generate processor
/docling-scaffold-processor extract_papers

# Process PDFs
uv run python scripts/extract_papers.py \
  --input-dir data/papers/ \
  --output-file extracts/papers.jsonl

Example 2: Process Scanned Documents (with Granite)

# Generate processor with Granite support
/docling-scaffold-processor process_scans --granite

# Process with Granite model for better OCR
uv run python scripts/process_scans.py \
  --input-dir data/scanned/ \
  --output-file extracts/scanned.jsonl \
  --granite

Example 3: Extract from HTML Documents

# Generate processor for HTML
/docling-scaffold-processor extract_html --input-types html

# Process HTML files
uv run python scripts/extract_html.py \
  --input-dir data/web_content/ \
  --output-file extracts/web.jsonl

Integration with Other Tools

Docling extracts (JSONL format) work seamlessly with:

BAML Toolkit

# After extracting with Docling, process with BAML
/baml-toolkit:batch-gemini GenerateProfile \
  extracts/output.jsonl \
  --output profiles.json

LangChain

from langchain.document_loaders import JSONLoader

loader = JSONLoader(
    file_path="extracts/output.jsonl",
    jq_schema=".text",
    metadata_func=lambda record, metadata: {
        **metadata,
        "source": record.get("origin"),
        "page": record.get("page_number")
    }
)
docs = loader.load()

Custom Processing

import json

with open("extracts/output.jsonl") as f:
    for line in f:
        extract = json.loads(line)
        print(f"Source: {extract['origin']}, Page: {extract['page_number']}")
        print(f"Content: {extract['text']}")

Configuration

Create .claude/docling-toolkit.local.md in your project for custom settings:

---
# Docling Configuration
chunker_type: hybrid  # or: hierarchical
export_mode: doc_chunks  # or: markdown
use_granite_model: false

# Metadata Configuration
metadata_fields:
  - page_number
  - section_title
  - doc_items
  - origin
  - extraction_date

# Output Configuration
output_format: jsonl
output_directory: ./extracts

# Script Generation Preferences
include_progress_tracking: true
include_validation: true
include_granite_support: true
python_style: production
---

Docling Chunking Strategies

HybridChunker (Default)

Combines hierarchical chunking with token-aware sizing
Best for: RAG applications, embedding models with token limits
Output: Semantically meaningful chunks optimally sized for embeddings

HierarchicalChunker

Pure structure-based chunking (sections, paragraphs, tables)
Best for: Preserving document structure, citation-heavy workflows
Output: Chunks following document hierarchy

When to Use Each

Use Case	Recommended Chunker
RAG with vector search	HybridChunker
Document analysis	HierarchicalChunker
Citation tracking	HierarchicalChunker
Embedding models	HybridChunker
Profile synthesis	HybridChunker

Metadata Captured

Docling automatically extracts:

origin: Source document information
page_number: Page location
section_title: Section/heading context
doc_items: Structural elements (paragraphs, tables, figures)
extraction_date: Processing timestamp
layout: Visual layout information

Troubleshooting

"Docling not found"

uv add docling
# or
pip install docling

"Granite model not available"

# Granite requires additional dependencies
uv add docling[granite]

"Out of memory when processing large PDFs"

Process files individually instead of batch
Use --batch-size flag in generated scripts
Consider splitting very large PDFs

Generated script errors

Ask Claude for help:

"Can you help debug my Docling processing script? I'm getting [error]"

The script-advisor agent will assist with debugging and customization.

Best Practices

Start with clean PDFs: Born-digital PDFs work best; use Granite for scanned docs
Validate early: Run validation after processing first batch to catch issues
Preserve metadata: Always include source attribution for downstream use
Use JSONL format: Streamable, append-friendly, works with all tools
Process incrementally: Don't wait to process all docs at once
Test chunking strategies: Compare HybridChunker vs HierarchicalChunker on sample docs

Workflow Integration

Typical Document Processing Pipeline

1. Raw Documents (PDF, HTML)
   ↓
2. Docling Extraction (this plugin)
   - Structure-aware chunking
   - Metadata extraction
   - JSONL output
   ↓
3. Downstream Processing (other tools)
   - BAML: Structured extraction
   - LangChain: Vector store ingestion
   - Custom: Direct processing
   ↓
4. Final Output (profiles, search, analysis)

Contributing

This plugin is maintained by Orlando Bruno. Suggestions and improvements welcome!

Resources

Version History

0.1.0 (Current)

Initial release
3 skills: fundamentals, chunking, advanced
3 commands: scaffold-processor, init-project, validate-extracts
2 agents: docling-advisor, script-advisor
Production-ready script generation
JSONL output format
Granite model support

License

MIT License - Free for personal and commercial use.

Last Updated: 2025-12-27