Tran: PDF Extraction - Docling vs. LlamaPars vs. Marker vs. others

Key Takeaways:

Tran ran a nice experiment on PDF extraction, especially for tables, using different models/libraries
In short:
- LlamaParse wins on speed and accuracy. It’s the fastest overall and produces the cleanest output, but it requires sending PDFs to LlamaCloud.
- Marker is the best local option. It’s faster than Docling and handles simple tables well, but it merges columns on dense layouts.
- Docling is the slowest of the three and prone to hallucinating values on dense tables.
- In short, neither of the new VLM-OCR tools beats LlamaParse on this PDF
  - olmOCR-2, a 7B open-weight OCR model from Allen AI
  - PaddleOCR-VL 1.6, a 1B document parser from Baidu with a layout-detection pipeline
When to use each:
- Use LlamaParse if your documents aren’t sensitive and you want the best accuracy.
- Use Marker if you must stay local.
- Use Docling for its broader document conversion features beyond just table extraction like chunking and RAG

Feature	Docling	Marker	LlamaParse
Table detection	Vision-language model (local)	5-stage specialized pipeline (local)	LLM agent (cloud)
Multi-level headers	Returns integer column names; mishandles parent groups	Keeps as separate rows with <br> tags	Flattens with <br/> tags, preserves grouping
Dense numeric tables	Hallucinates values, repetition loops	Merges columns, packs values into single cells	Extracts all values correctly
Speed (6-page PDF)	~1 min 50s	~47s	~8.54s
Dependencies	docling[vlm] + mlx-vlm (Apple) or transformers	marker-pdf	API key
Pricing	Free (MIT)	Free (GPL-3.0)	Free tier (10k credits/month)

Feature	Docling	Marker	LlamaParse	olmOCR-2	PaddleOCR-VL 1.6
Approach	Vision-language model (local)	Pipeline (local)	LLM agent (cloud)	Vision-language model (local)	Pipeline (local)
Tables detected (3 in PDF)	2	3	3	3	2
Accuracy overall	Poor: hallucinates values on dense tables	Mixed: column collapse on borderless tables	High: values correct, structure flattened	Mixed: silent character errors (digit drift, hyphen→decimal)	High: values correct, header grouping mis-aligned
Speed (M5 Pro, 9-page PDF)	~1 min 50s	~47s	~8.54s	~5 min 34s	~7 min 56s
Pricing	Free (MIT)	Free (GPL-3.0)	Free tier (10k credits/month)	Free (Apache 2.0)	Free (Apache 2.0)

Sources: