How do I train Tesseract on a custom font?

You can use the tesstrain repository. It is a Makefile-based wrapper that automates the complex process of generating ground-truth data, extracting features, and fine-tuning an existing .traineddata model to learn new glyphs seamlessly.

Tesseract OCR — The World's Best Open Source OCR Engine

Q: Does Tesseract work in the browser?

Yes. Projects like tesseract.js compile the C++ codebase to WebAssembly (WASM), allowing you to run Tesseract's neural net directly inside the user's browser without requiring a backend server.

Q: What image formats does Tesseract OCR support?

Tesseract natively supports PNG, JPEG, TIFF, BMP, PNM, GIF, and WebP image formats via the Leptonica image processing library. For best OCR accuracy, use lossless formats such as TIFF or PNG at 300 DPI or higher. JPEG compression can introduce artifacts that reduce recognition accuracy.

60k+

GitHub Stars

100+

Languages Supported

40yrs

Proven Heritage

99%+

Recognition Accuracy

$0

Cost — Forever Free

tesseract_demo.sh

$ # Extract text from a scanned document

$ tesseract document.tiff outputbase -l eng --psm 3

Tesseract Open Source OCR Engine v5.3.0 with Leptonica

Page 1

Estimating resolution as 300

$ cat outputbase.txt

The quick brown fox jumps over the lazy dog. This text was successfully extracted with 99.8% confidence.

Three Steps to Structured Data.

Integrate Tesseract into your pipeline to turn arbitrary image data into structured text and searchable PDFs instantly.

01

Provide Image

Pass TIFF, JPEG, PNG, or WebP inputs. Tesseract leverages Leptonica for robust image processing, binarization, and scaling.

02

Layout Analysis

The engine segments the page into blocks of text, lines, and words. Choose from 14 Page Segmentation Modes (PSM) for perfect targeting.

03

Execute LSTM

Neural networks evaluate the character lines continuously. Tesseract supports various output formats: plain text, hOCR (HTML), standard PDF, invisible-text-only PDF, TSV, ALTO, and PAGE.

📥 How to install Tesseract OCR

Zero to OCR in minutes. Select your operating system below for the official tesseract ocr download guide.

macOS Ubuntu / Debian Windows

Step 1

Install Library

Tesseract is natively compiled. Install the core CLI engine via Homebrew.

os_install.sh

$ brew install tesseract

Step 2

Download Languages

Fetch .traineddata files for non-English languages.

lang_install.sh

$ brew install tesseract-lang

Step 3

Test Installation

Verify exactly which version is currently active.

check_version.sh

$ tesseract --version

Step 1

Install Library

Tesseract is available directly in the apt repository.

apt_install.sh

$ sudo apt install tesseract-ocr

Step 2

Download Languages

Fetch .traineddata files for non-English languages.

apt_lang.sh

$ sudo apt install tesseract-ocr-all

Step 3

Test Installation

Verify exactly which version is currently active.

verify_tess.sh

$ tesseract --version

Step 1

Tesseract OCR Windows Installer

Tesseract cannot be natively compiled easily on Windows. The de facto standard is to download the official pre-compiled binaries from the UB Mannheim University project.

Download UB Mannheim

Step 2

Configure System PATH (Crucial Step)

This is the most common reason Tesseract fails on Windows. You absolutely must add the installation directory to your global Environment Variables.

Open the Windows Start Menu and search for Environment Variables.
Click Edit the system environment variables.
Click the Environment Variables... button at the bottom right.
Under the System variables (bottom half), find and double-click the Path variable.
Click New and paste: C:\Program Files\Tesseract-OCR
Click OK on all out, then completely restart your command prompt or VSCode instance.

⚡ Quick Command Cheatsheet

The most frequently used Tesseract CLI commands for daily OCR tasks.

Basic Text Extraction

Extract text from an image to a plain .txt file.

tesseract image.png output

Specify Language

Use the -l flag for a specific language (e.g., German).

tesseract image.png output -l deu

Multiple Languages

Combine languages with a + sign for mixed-language documents.

tesseract image.png output -l eng+fra

Output as PDF

Generate a searchable PDF with invisible selectable text.

tesseract image.png output pdf

Extract Bounding Boxes (hOCR)

Get exact coordinates of every recognized word in HTML structure.

tesseract image.png output hocr

Custom Layout Analysis

Assume a single uniform block of text using --psm 6.

tesseract image.png output --psm 6

List Available Languages

Show all downloaded language models available on your system.

tesseract --list-langs

Extract to TSV (Spreadsheet)

Export detailed word-by-word data and confidence scores.

tesseract image.png output tsv

A complete documentation engine.

Don't just extract characters. Extract context, layout, formatting, and high-fidelity representations of physical documents.

Core Engine Architecture

Deep Learning with LSTM

Modern Tesseract (v5+) uses Long Short-Term Memory (LSTM) recurrent neural networks to achieve state-of-the-art OCR accuracy. Unlike traditional pattern matching, this sequence-to-sequence architecture treats text lines as continuous data, allowing the engine to "understand" context and drastically reduce recognition errors in complex documents.

100+ OCR Languages

Drop-in models available for Latin, Cyrillic, Arabic, CJK, and Indic scripts. Instantly switch between languages or use multiple concurrently natively.

Searchable PDF Generation

Feed a massive batch of unsearchable TIFs or JPEGs directly into Tesseract and receive a fully bundled, searchable PDF document with the invisible text perfectly overlaid on top of the original images. One command. Unlimited scaling.

Bring Your Own Data

Fine-tune the neural net on custom fonts or historical manuscripts. The `tesstrain` repo makes it simple to compile your own `.traineddata` models.

Engine Modes (OEM)

Toggle between the legacy engine, the fast neural network, or combine both for maximum accuracy on tricky legacy documentation. Modern Tesseract allows switching engines mid-stream to optimize for different document types.

Layout Introspection

Export in hOCR or ALTO formats. Get exact bounding boxes, font size estimations, baseline geometry, and confidence scores for every single word.

Performance & Accuracy Benchmarks.

Empirical data demonstrating Tesseract v5's capabilities across varying image qualities and resolutions.

Input DPI	Character Accuracy	Processing Speed	Recommended Use
150 DPI	~88.5%	Fast (~0.8s)	Draft/Preview only
300 DPI	99.2%	Optimal (~1.2s)	Standard Documents
600 DPI	99.7%	Slow (~3.5s)	High-fidelity Archiving

Note on Accuracy: These benchmarks were recorded using the -l eng (English) LSTM model on standard printed black-on-white text. Results vary by language model and image noise levels. View official testing details →

Built on Decades of Research.

Tesseract is not just a tool; it's a milestone in computational linguistics. Its development is documented in decades of peer-reviewed research that forms the foundation of modern AI-driven text recognition.

Ray Smith (2007): "An Overview of the Tesseract OCR Engine" — The defining paper on the engine's layout and word analysis logic.

Ray Smith (1994): "A Comparison of Tesseract and Other OCR Systems" — Historic benchmark study of early-stage Tesseract performance.

Real-world automation.

Tesseract doesn't just read words, it enables completely autonomous data pipelines across massive global industries.

$42.50

{ "total": 42.50 }

FinTech Expense Automation

Integrate Tesseract into FinTech apps to instantly extract totals, dates, and line items from smartphone photos of user receipts.

XYZ-1234

Smart City Vehicle Recognition

Power open-source ANPR (Automatic Number Plate Recognition) for intelligent parking systems or smart city traffic monitoring using tesseract engine.

KYC & Onboarding

Streamline user verification pipelines by programmatically extracting names and IDs from passports and licenses.

PDF

Mass Digitization

Convert millions of unsearchable library archives or corporate TIFFs into perfectly searchable, bundled digital PDFs.

Tesseract OCR Languages Support

Tesseract supports over 100 languages and scripts out of the box. Search for your language below and get the direct .traineddata model download link.

Usage: After downloading, place the .traineddata file in your TESSDATA_PREFIX directory, then run: tesseract image.png output -l ara For multiple languages simultaneously: tesseract image.png output -l eng+hin+fra

Tesseract GUI Frontends

Tesseract operates strictly via the command line. However, the open-source community has built excellent tesseract gui applications powered by the engine for users who prefer graphical interfaces.

Application	Platform	Price	Best Used For
gImageReader	Windows, Linux	Free	Comprehensive document processing and manual layout management
Capture2Text	Windows	Free	Lightning-fast screen capture OCR with keyboard shortcuts
NAPS2	Windows, Mac, Linux	Free	All-in-one document scanning, organizing, and OCR PDF creation
FreeOCR	Windows	Free	Beginners needing a classic, straightforward desktop utility
Tesseract-UI / Normcap	Mac, Windows, Linux	Free	Simple drag-and-drop file processing and cross-platform screen grabbing

Note: Tesseract accuracy depends entirely on the version of the engine installed beneath these wrappers. It is always recommended to use wrappers supporting Tesseract v5.

Page Segmentation Modes (PSM)

Tesseract doesn't force a one-size-fits-all approach. Documents are messy. That's why Tesseract exposes 14 different Page Segmentation Modes.

PSM 3: Fully automatic page segmentation, but no OSD. (Default)
PSM 6: Assume a single uniform block of text. Great for books.
PSM 11: Sparse text. Find as much text as possible in no particular order.

[ OCR REGION BOUNDING BOX ]

pipeline.py

import pytesseract
from PIL import Image

def extract_document(path):
    img = Image.open(path)
    return pytesseract.image_to_string(img)

target_text = extract_document('receipt.jpg')
print(target_text)

Tesseract OCR Python Integration

Because Tesseract is a compiled C/C++ library shipped natively across all major OS package managers, the wrapper ecosystem is massive.

Whether you're building an Express backend with `tesseract.js`, a tesseract ocr python data-pipeline with `pytesseract`, or a Go microservice with `gosseract`, integration is literally one line away. Tesseract acts as the silent, invisible workhorse underpinning your application.

Why Tesseract is special.

Tesseract handles the hard orchestration of pixel-to-text reliably.

100% Free and Open.

No tokens, no API keys, no vendor lock-in. Process 1 page or 10 million pages at zero cost.

Offline by Default.

Absolute privacy. Your images never leave your server. Perfect for medical, legal, and financial PII.

Leptonica Powered.

Built-in integration with the Leptonica image processing library allows for implicit thresholding, scaling, and noise removal.

Continuous Evolution.

Originally developed at HP in 1985, open-sourced in 2005, and significantly upgraded by Google until 2017. Currently maintained by Stefan Weil and Zdenko Podobny, honoring the legacy of Ray Smith.

Hardware Agnostic.

Runs perfectly on a Raspberry Pi, a standard web server, or high-end multicore clusters via OpenMP parallelization.

Direction/Script Aware.

Automatically detects orientation and script type (OSD mode), adjusting its parsers internally without human intervention.

Frequently asked questions about Tesseract OCR.

What does Tesseract OCR do in real life?

In the software world, Tesseract takes raw images containing text (like scanned invoices, receipts, and old books) and translates the pixels into machine-readable, searchable, and editable text. It is the engine powering thousands of data automation tools, digital archives, and mobile scanning apps worldwide.

Is Tesseract OCR free to use commercially?

Yes! Tesseract OCR is 100% free and open-source. It is released under the permissive Apache 2.0 License, allowing you to use it for personal, academic, and commercial production projects without paying any licensing or API fees whatsoever.

Is Tesseract OCR still good in 2026?

Absolutely. Since the introduction of the deep learning LSTM neural network engine in version 4 (further optimized in version 5), Tesseract remains highly competitive with expensive paid OCR APIs—especially for printed material—while offering the massive benefit of running entirely offline for absolute data privacy. As of 2026, it remains the most reliable open-source solution for global document digitization.

Is Tesseract OCR considered AI?

Yes. While older versions (v3) relied on traditional pattern-matching algorithms, modern Tesseract (v4 and v5) is entirely AI-driven. It utilizes Long Short-Term Memory (LSTM) recurrent neural networks, a form of deep learning architecture explicitly designed to handle sequential data like text lines.

What is the difference between Tesseract v4 and v5?

Tesseract 5 offers substantially better performance (speed) than version 4 while retaining the highly accurate LSTM (Long Short-Term Memory) neural network models. Many memory leaks were fixed, and training tools were modernized. Tesseract 5 is the current recommended version for production deployments.

Does Tesseract include a graphical user interface (GUI)?

No. Tesseract is primarily a C/C++ library and a command-line interface (CLI) program. If you need a graphical user interface, you must use a third-party application built around the Tesseract engine. Popular GUI options include gImageReader and FreeOCR.

Does Tesseract work in the browser?

Yes. Projects like tesseract.js compile the C++ codebase to WebAssembly (WASM), allowing you to run Tesseract's neural net directly inside the user's browser without requiring a backend server — making it ideal for client-side OCR applications.

How do I train Tesseract OCR on a custom font?

You can use the tesstrain repository. It is a Makefile-based wrapper that automates the complex process of generating ground-truth training data, extracting features, and fine-tuning an existing .traineddata model to learn your new glyphs seamlessly.

Why is Tesseract OCR output gibberish sometimes?

Tesseract requires reasonably clean, high-contrast images. If your image is blurry, has low contrast, or contains complex graphical backgrounds, you must preprocess the image (binarization, deskewing) using OpenCV or ImageMagick before passing it to Tesseract for best free OCR results.

What image formats does Tesseract OCR support?

Tesseract natively supports PNG, JPEG, TIFF, BMP, PNM, GIF, and WebP image formats via the Leptonica image processing library. For best OCR accuracy, use lossless formats such as TIFF or PNG at 300 DPI or higher. JPEG compression can introduce artifacts that reduce recognition accuracy on fine characters.

How accurate is Tesseract OCR?

On clean, high-resolution scanned documents, Tesseract v5 achieves over 95% character accuracy and frequently exceeds 99% on standard printed text at 300 DPI. Accuracy degrades on low-quality scans, handwriting, stylized fonts, or noisy backgrounds. Image preprocessing (binarization, deskewing, noise removal) can dramatically improve results on difficult inputs.

Does Tesseract OCR work offline?

Yes. Tesseract is a 100% offline, local OCR engine. It processes images entirely on your own machine without sending any data to external servers. This makes it ideal for privacy-sensitive use cases involving medical records, legal documents, or financial data where cloud OCR APIs are not permitted.

Tesseract OCR vs Google Cloud Vision API — which should I use?

Tesseract OCR is free, open-source, and runs entirely offline — ideal for privacy-critical workflows, high-volume processing without per-call costs, and air-gapped environments. Google Cloud Vision API offers higher out-of-the-box accuracy on complex or low-quality images and handwriting, but charges per API call and requires sending images to Google's servers. Choose Tesseract when cost, privacy, or offline operation matters; choose Vision API when maximum accuracy on diverse inputs is the priority.

How do I improve Tesseract OCR accuracy?

The single biggest factor in Tesseract accuracy is image quality. Key preprocessing steps:

Upscale to 300 DPI or higher — Tesseract performs poorly on small or low-resolution images.
Binarize — Convert to grayscale and apply Otsu's thresholding for a clean black-and-white image.
Deskew — Correct any rotation or perspective distortion before processing.
Remove noise — Apply a median blur or morphological operations to clean up speckles.
Choose the correct PSM — Use --psm 6 for uniform blocks, --psm 7 for single lines, etc.
Use the right language model — e.g., -l eng+fra for mixed English/French documents.

OpenCV and ImageMagick are the most commonly used tools for building image preprocessing pipelines before Tesseract.

Ready to extract data?

From zero to autonomous OCR pipeline in one command.

$ tesseract scan.png stdout -l eng

Read the Docs View GitHub Repo