
Running Tesseract with command-line
Tesseract is a command-line program, so first open a terminal or command prompt. The command is used like this:
1 |
tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...] |
So basic usage to do OCR on an image called ‘myscan.png’ and save the result to ‘out.txt’ would be:
1 |
tesseract myscan.png out |
Or to do the same with German:
1 |
tesseract myscan.png out -l deu |
It can even be used with multiple languages traineddata at a time eg. English and German:
1 |
tesseract myscan.png out -l eng+deu |
Tesseract also includes a hOCR mode, which produces a special HTML file with the coordinates of each word. This can be used to create a searchable pdf, using a tool such as Hocr2PDF. To use it, use the ‘hocr’ config option, like this:
1 |
tesseract myscan.png out hocr |
You can also create a searchable pdf directly from tesseract ( versions >=3.03):
1 |
tesseract myscan.png out pdf |
More information about the various options is available in the Tesseract manpage.
Other Languages
Tesseract has been trained for many languages, check for your language in the Tessdata repository.
For example, if we want Tesseract support Chinese language, just put chi_sim.traineddata into the path /usr/local/Cellar/tesseract/3.05.01/share/tessdata/。
It can also be trained to support other languages and scripts; for more details see TrainingTesseract.
Running Tesseract with Python
Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images.
Usage
Quick start
1 |
try: |
Support for OpenCV image/NumPy array objects
1 |
import cv2 |
Add the following config, if you have tessdata error like: “Error opening data file…”
1 |
tessdata_dir_config = '--tessdata-dir "<replace_with_your_tessdata_dir_path>"' |
Functions
- image_to_string Returns the result of a Tesseract OCR run on the image to string
- image_to_boxes Returns result containing recognized characters and their box boundaries
- image_to_data Returns result containing box boundaries, confidences, and other information. Requires Tesseract 3.05+. For more information, please check the Tesseract TSV documentation
Parameters
image_to_data(image, lang=None, config='', nice=0, output_type=Output.STRING)
- image Object, PIL Image/NumPy array of the image to be processed by Tesseract
- lang String, Tesseract language code string
- config String, Any additional configurations as a string, ex:
config='--psm 6' - nice Integer, modifies the processor priority for the Tesseract run. Not supported on Windows. Nice adjusts the niceness of unix-like processes.
- output_type Class attribute, specifies the type of the output, defaults to
string. For the full list of all supported types, please check the definition of pytesseract.Output class.




近期评论