Tesseract OCR — Open Source Text Recognition Engine
Welcome to the official, modernized documentation for Tesseract OCR. Learn how to install, configure, and scale the most powerful open-source Optical Character Recognition engine in the world.
What is Tesseract?
Tesseract is an engine that takes raw image pixels and converts them into structured, searchable text data. Originating at Hewlett Packard in 1985, it is currently maintained by the global open source community and handles over 100 languages natively via deep learning Long Short-Term Memory (LSTM) neural networks.
Installation
Because Tesseract is an optimized C++ library, the easiest way to install it is via your system's package manager.
macOS
Homebrew is the officially recommended method for macOS (Silicon and Intel).
brew install tesseract
brew install tesseract-lang
Ubuntu / Debian
The standard `apt` repositories carry stable versions of Tesseract.
sudo apt install tesseract-ocr
sudo apt install tesseract-ocr-all
Windows
Pre-compiled Windows installers are provided by UB Mannheim. Standard package managers like `scoop` or `winget` also support Tesseract natively.
scoop install tesseract
scoop install tesseract-languages
Quickstart & Basic CLI
Tesseract is fundamentally a command-line tool. Providing an image and determining a text output requires a single line.
tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]
Example: Extract English Text
To extract text from `invoice.png` and save it to `invoice_result.txt`:
tesseract invoice.png invoice_result -l eng
Page Segmentation Modes (PSM)
By default, Tesseract expects a page of text. If your image represents a single word, a vertical block of Japanese, or a sparse diagram, you must declare a Page Segmentation Mode (`--psm`).
- 0: Orientation and script detection (OSD) only.
- 1: Automatic page segmentation with OSD.
- 3: Fully automatic page segmentation, but no OSD. *(Default)*
- 4: Assume a single column of text of variable sizes.
- 6: Assume a single uniform block of text.
- 7: Treat the image as a single text line.
- 11: Sparse text. Find as much text as possible in no particular order.
- 13: Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
To force an assumed single line of characters:
tesseract barcode.png stdout --psm 7
OCR Engine Modes (OEM)
Tesseract houses two vastly different underlying recognition engines. You can toggle between them using the `--oem` tag.
- 0: Legacy engine only. (Uses traditional computer vision parsing).
- 1: Neural nets LSTM engine only. (Fast, highly accurate sequential memory parsing).
- 2: Legacy + LSTM engines combined.
- 3: Default, based on what is available in your `.traineddata` models.
Output Formats
While extracting to raw `stdout` or `.txt` is common, Tesseract is a full document analyzer capable of emitting layout geometries and fully compliant PDFs.
Searchable PDFs
To convert an image to a bundled, searchable PDF where the recognized text is laid invisibly over the raw image:
tesseract document.tif output_name pdf
Invisible Text Only PDF
Useful if you are overlaying text over existing PDFs inside an orchestration pipeline:
tesseract scan.png output textonly_pdf
hOCR / TSV / ALTO
If you require data detailing the exact pixel bounding boxes of every single extracted word and its confident rating, use layout generation modes.
tesseract input.png out hocr
tesseract input.png out tsv
Programming Wrappers
Do you want to integrate Tesseract inside a web application or microservice? The open source community has built wrappers for nearly every language.
Python (pytesseract)
Requires the Tesseract CLI tool to be installed on the machine.
import pytesseract
from PIL import Image
img = Image.open('image.png')
text = pytesseract.image_to_string(img)
print(text)
Node.js JavaScript (tesseract.js)
This is a pure WebAssembly port of the Tesseract C++ API. It can run massively in the browser without any server installations.
const Tesseract = require('tesseract.js');
Tesseract.recognize(
'https://tesseract.project/image.png',
'eng',
{ logger: m => console.log(m) }
).then(({ data: { text } }) => {
console.log(text);
});
Training Custom OCR Models
Tesseract 5 uses the `tesstrain` project infrastructure to manipulate the LSTM models. Modifying these neural nets requires `Make` and significantly complex ground-truth generation.
The tesstrain Repository
Unlike version 3, which relied heavily on manual box manipulation, version 5 training is automated using Makefiles that generate massive pipelines of training logic.
git clone https://github.com/tesseract-ocr/tesstrain
cd tesstrain
make tesseract-langdata
For deep knowledge on curating Ground Truth (GT) and fine-tuning epochs, refer directly to the `tesstrain` GitHub repository.
End of Tesseract Core Documentation.