Docling Toolkit for Claude Code
Expert tools for document extraction using IBM's Docling library.
Docling Toolkit for Claude Code
Expert guidance and tooling for document extraction using IBM's Docling library.
Overview
The Docling Toolkit plugin provides comprehensive support for using Docling to extract structured data from documents. It helps you convert PDFs, HTML, and other document formats into clean, citation-rich JSONL files ready for downstream AI processing.
What is Docling?
Docling is an open-source document processing library developed at IBM Research and donated to the LF AI & Data Foundation. It transforms complex documents into structured, machine-readable data with:
- Structure-aware chunking: Preserves document hierarchy (sections, paragraphs, tables, figures)
- Rich metadata extraction: Automatically captures page numbers, section titles, layout information
- Granite model support: Enhanced processing for scanned documents and complex layouts
- Enterprise-grade quality: Battle-tested at IBM, 42K+ GitHub stars, 1.5M monthly downloads
Features
Skills (AI-Invoked Autonomously)
Claude will automatically help with Docling when you:
- docling-fundamentals: Ask about Docling installation, basic usage, or when to use Docling
- docling-chunking: Discuss chunking strategies, HybridChunker vs HierarchicalChunker, metadata extraction
- docling-advanced: Work with Granite model, scanned PDFs, complex documents, or performance optimization
Commands (User-Invoked)
/docling-scaffold-processor- Generate production-ready document processing script/docling-init-project- Initialize Docling extraction project structure/docling-validate-extracts- Validate extract quality and metadata completeness
Agents (Specialized Assistance)
- docling-advisor: Recommends Docling configuration and workflow design for your use case
- script-advisor: Helps debug and customize generated Docling scripts
Installation
Prerequisites
- Claude Code installed
- Python 3.11+ with
uvpackage manager - Docling library:
uv add docling # or pip install docling
Install Plugin
# From the Claude-Plugins directory
claude plugin install ./docling-toolkit --scope user
# Or use absolute path
claude plugin install /Users/orlandobruno/Documents/Dev/Claude-Plugins/docling-toolkit --scope user
Verify Installation
claude plugin list
# Should show "docling-toolkit" in the list
Quick Start
1. Initialize a Project
# In your project directory
/docling-init-project my-document-extraction
cd my-document-extraction
This creates:
my-document-extraction/
├── README.md
├── config/
│ └── docling-config.yaml
├── data/
│ ├── raw/ # Place your PDFs/HTML here
│ └── processed/
├── extracts/ # Docling output (JSONL)
├── scripts/
│ ├── process_documents.py
│ └── validate_extracts.py
├── logs/
└── .env.example
2. Generate Processing Script
/docling-scaffold-processor process_documents --input-types pdf,html
This generates a production-ready Python script with:
- Docling HybridChunker integration
- CLI arguments (--input-dir, --output-file, --granite)
- Progress tracking and error handling
- Metadata extraction
- JSONL output format
3. Process Documents
# Place PDFs in data/raw/ then run:
uv run python scripts/process_documents.py \
--input-dir data/raw \
--output-file extracts/output.jsonl
4. Validate Extracts
/docling-validate-extracts extracts/output.jsonl
Gets a quality report with:
- Metadata completeness check
- Structure preservation validation
- Statistics (avg chunk size, source distribution)
- Quality metrics and recommendations
Usage Examples
Example 1: Extract from Research Papers
# Initialize project
/docling-init-project research-extraction
# Generate processor
/docling-scaffold-processor extract_papers
# Process PDFs
uv run python scripts/extract_papers.py \
--input-dir data/papers/ \
--output-file extracts/papers.jsonl
Example 2: Process Scanned Documents (with Granite)
# Generate processor with Granite support
/docling-scaffold-processor process_scans --granite
# Process with Granite model for better OCR
uv run python scripts/process_scans.py \
--input-dir data/scanned/ \
--output-file extracts/scanned.jsonl \
--granite
Example 3: Extract from HTML Documents
# Generate processor for HTML
/docling-scaffold-processor extract_html --input-types html
# Process HTML files
uv run python scripts/extract_html.py \
--input-dir data/web_content/ \
--output-file extracts/web.jsonl
Integration with Other Tools
Docling extracts (JSONL format) work seamlessly with:
BAML Toolkit
# After extracting with Docling, process with BAML
/baml-toolkit:batch-gemini GenerateProfile \
extracts/output.jsonl \
--output profiles.json
LangChain
from langchain.document_loaders import JSONLoader
loader = JSONLoader(
file_path="extracts/output.jsonl",
jq_schema=".text",
metadata_func=lambda record, metadata: {
**metadata,
"source": record.get("origin"),
"page": record.get("page_number")
}
)
docs = loader.load()
Custom Processing
import json
with open("extracts/output.jsonl") as f:
for line in f:
extract = json.loads(line)
print(f"Source: {extract['origin']}, Page: {extract['page_number']}")
print(f"Content: {extract['text']}")
Configuration
Create .claude/docling-toolkit.local.md in your project for custom settings:
---
# Docling Configuration
chunker_type: hybrid # or: hierarchical
export_mode: doc_chunks # or: markdown
use_granite_model: false
# Metadata Configuration
metadata_fields:
- page_number
- section_title
- doc_items
- origin
- extraction_date
# Output Configuration
output_format: jsonl
output_directory: ./extracts
# Script Generation Preferences
include_progress_tracking: true
include_validation: true
include_granite_support: true
python_style: production
---
Docling Chunking Strategies
HybridChunker (Default)
- Combines hierarchical chunking with token-aware sizing
- Best for: RAG applications, embedding models with token limits
- Output: Semantically meaningful chunks optimally sized for embeddings
HierarchicalChunker
- Pure structure-based chunking (sections, paragraphs, tables)
- Best for: Preserving document structure, citation-heavy workflows
- Output: Chunks following document hierarchy
When to Use Each
| Use Case | Recommended Chunker |
|---|---|
| RAG with vector search | HybridChunker |
| Document analysis | HierarchicalChunker |
| Citation tracking | HierarchicalChunker |
| Embedding models | HybridChunker |
| Profile synthesis | HybridChunker |
Metadata Captured
Docling automatically extracts:
origin: Source document informationpage_number: Page locationsection_title: Section/heading contextdoc_items: Structural elements (paragraphs, tables, figures)extraction_date: Processing timestamplayout: Visual layout information
Troubleshooting
"Docling not found"
uv add docling
# or
pip install docling
"Granite model not available"
# Granite requires additional dependencies
uv add docling[granite]
"Out of memory when processing large PDFs"
- Process files individually instead of batch
- Use
--batch-sizeflag in generated scripts - Consider splitting very large PDFs
Generated script errors
Ask Claude for help:
"Can you help debug my Docling processing script? I'm getting [error]"
The script-advisor agent will assist with debugging and customization.
Best Practices
- Start with clean PDFs: Born-digital PDFs work best; use Granite for scanned docs
- Validate early: Run validation after processing first batch to catch issues
- Preserve metadata: Always include source attribution for downstream use
- Use JSONL format: Streamable, append-friendly, works with all tools
- Process incrementally: Don't wait to process all docs at once
- Test chunking strategies: Compare HybridChunker vs HierarchicalChunker on sample docs
Workflow Integration
Typical Document Processing Pipeline
1. Raw Documents (PDF, HTML)
↓
2. Docling Extraction (this plugin)
- Structure-aware chunking
- Metadata extraction
- JSONL output
↓
3. Downstream Processing (other tools)
- BAML: Structured extraction
- LangChain: Vector store ingestion
- Custom: Direct processing
↓
4. Final Output (profiles, search, analysis)
Contributing
This plugin is maintained by Orlando Bruno. Suggestions and improvements welcome!
Resources
Version History
0.1.0 (Current)
- Initial release
- 3 skills: fundamentals, chunking, advanced
- 3 commands: scaffold-processor, init-project, validate-extracts
- 2 agents: docling-advisor, script-advisor
- Production-ready script generation
- JSONL output format
- Granite model support
License
MIT License - Free for personal and commercial use.
Last Updated: 2025-12-27