

What is We benchmarked 18 LLMs on OCR (7K+ calls)?
Arbitr is a benchmarking platform that evaluates large language models on optical character recognition (OCR) tasks. The team tested 18 different LLMs across over 7,000 calls to measure their OCR performance and cost-effectiveness. The key finding from their research is that cheaper models often outperform expensive alternatives for this specific use case. The leaderboards they publish provide transparent, data-driven comparisons to help teams choose the right model for their OCR needs. This is particularly useful for organisations building AI agents or systems that need reliable document processing without excessive spending.
Key Features
OCR benchmarking across 18 LLMs with 7,000+ test calls
Performance leaderboards showing accuracy and cost metrics
Cost-effectiveness analysis comparing expensive versus budget models
Data-driven model comparison for informed selection
Freemium access to benchmark results and leaderboards
Pros & Cons
Advantages
- Provides independent, transparent benchmark data rather than relying on vendor claims
- Highlights that expensive models are not always better, potentially saving significant costs
- Large test dataset (7,000+ calls) increases confidence in results
- Free access to leaderboards makes research available to all teams
Limitations
- Focuses specifically on OCR performance; results may not generalise to other LLM tasks
- Limited information on whether benchmarks cover different document types or languages
- Platform appears to be a reference resource rather than a deployment tool for running OCR directly
Use Cases
Choosing an LLM for document processing pipelines in AI agents
Evaluating cost-benefit of different models for expense report automation
Selecting models for invoice or receipt scanning systems
Cost optimisation for document-heavy workflows
Comparing model performance before committing to a specific LLM provider