Tran: PDF Extraction - Docling vs. LlamaPars vs. Marker vs. others
- 2. Juni
- 2 Min. Lesezeit
Key Takeaways:

Tran ran a nice experiment on PDF extraction, especially for tables, using different models/libraries
In short:
LlamaParse wins on speed and accuracy. It’s the fastest overall and produces the cleanest output, but it requires sending PDFs to LlamaCloud.
Marker is the best local option. It’s faster than Docling and handles simple tables well, but it merges columns on dense layouts.
Docling is the slowest of the three and prone to hallucinating values on dense tables.
In short, neither of the new VLM-OCR tools beats LlamaParse on this PDF
olmOCR-2, a 7B open-weight OCR model from Allen AI
PaddleOCR-VL 1.6, a 1B document parser from Baidu with a layout-detection pipeline
When to use each:
Use LlamaParse if your documents aren’t sensitive and you want the best accuracy.
Use Marker if you must stay local.
Use Docling for its broader document conversion features beyond just table extraction like chunking and RAG
Feature | Docling | Marker | LlamaParse |
Table detection | Vision-language model (local) | 5-stage specialized pipeline (local) | LLM agent (cloud) |
Multi-level headers | Returns integer column names; mishandles parent groups | Keeps as separate rows with <br> tags | Flattens with <br/> tags, preserves grouping |
Dense numeric tables | Hallucinates values, repetition loops | Merges columns, packs values into single cells | Extracts all values correctly |
Speed (6-page PDF) | ~1 min 50s | ~47s | ~8.54s |
Dependencies | docling[vlm] + mlx-vlm (Apple) or transformers | marker-pdf | API key |
Pricing | Free (MIT) | Free (GPL-3.0) | Free tier (10k credits/month) |
Feature | Docling | Marker | LlamaParse | olmOCR-2 | PaddleOCR-VL 1.6 |
Approach | Vision-language model (local) | Pipeline (local) | LLM agent (cloud) | Vision-language model (local) | Pipeline (local) |
Tables detected (3 in PDF) | 2 | 3 | 3 | 3 | 2 |
Accuracy overall | Poor: hallucinates values on dense tables | Mixed: column collapse on borderless tables | High: values correct, structure flattened | Mixed: silent character errors (digit drift, hyphen→decimal) | High: values correct, header grouping mis-aligned |
Speed (M5 Pro, 9-page PDF) | ~1 min 50s | ~47s | ~8.54s | ~5 min 34s | ~7 min 56s |
Pricing | Free (MIT) | Free (GPL-3.0) | Free tier (10k credits/month) | Free (Apache 2.0) | Free (Apache 2.0) |
Sources:


