We benchmarked 18 LLMs on OCR (7K+ calls) screenshot

What is We benchmarked 18 LLMs on OCR (7K+ calls)?

Arbitr is a benchmarking platform that evaluates large language models on optical character recognition (OCR) tasks. The team tested 18 different LLMs across over 7,000 calls to measure their OCR performance and cost-effectiveness. The key finding from their research is that cheaper models often outperform expensive alternatives for this specific use case. The leaderboards they publish provide transparent, data-driven comparisons to help teams choose the right model for their OCR needs. This is particularly useful for organisations building AI agents or systems that need reliable document processing without excessive spending.

Key Features

OCR benchmarking across 18 LLMs with 7,000+ test calls

Performance leaderboards showing accuracy and cost metrics

Cost-effectiveness analysis comparing expensive versus budget models

Data-driven model comparison for informed selection

Freemium access to benchmark results and leaderboards

Pros & Cons

Advantages

  • Provides independent, transparent benchmark data rather than relying on vendor claims
  • Highlights that expensive models are not always better, potentially saving significant costs
  • Large test dataset (7,000+ calls) increases confidence in results
  • Free access to leaderboards makes research available to all teams

Limitations

  • Focuses specifically on OCR performance; results may not generalise to other LLM tasks
  • Limited information on whether benchmarks cover different document types or languages
  • Platform appears to be a reference resource rather than a deployment tool for running OCR directly

Use Cases

Choosing an LLM for document processing pipelines in AI agents

Evaluating cost-benefit of different models for expense report automation

Selecting models for invoice or receipt scanning systems

Cost optimisation for document-heavy workflows

Comparing model performance before committing to a specific LLM provider