tesseract的使用

Running Tesseract with command-line

Tesseract is a command-line program, so first open a terminal or command prompt. The command is used like this:

1
tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

So basic usage to do OCR on an image called ‘myscan.png’ and save the result to ‘out.txt’ would be:

1
tesseract myscan.png out

Or to do the same with German:

1
tesseract myscan.png out -l deu

It can even be used with multiple languages traineddata at a time eg. English and German:

1
tesseract myscan.png out -l eng+deu

Tesseract also includes a hOCR mode, which produces a special HTML file with the coordinates of each word. This can be used to create a searchable pdf, using a tool such as Hocr2PDF. To use it, use the ‘hocr’ config option, like this:

1
tesseract myscan.png out hocr

You can also create a searchable pdf directly from tesseract ( versions >=3.03):

1
tesseract myscan.png out pdf

More information about the various options is available in the Tesseract manpage.

Other Languages

Tesseract has been trained for many languages, check for your language in the Tessdata repository.

For example, if we want Tesseract support Chinese language, just put chi_sim.traineddata into the path /usr/local/Cellar/tesseract/3.05.01/share/tessdata/

It can also be trained to support other languages and scripts; for more details see TrainingTesseract.

Running Tesseract with Python

Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images.

Usage

Quick start

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
try:
import Image
except ImportError:
from PIL import Image
import pytesseract

pytesseract.pytesseract.tesseract_cmd = '<full_path_to_your_tesseract_executable>'

# Example tesseract_cmd: 'C:\Program Files (x86)\Tesseract-OCR\tesseract'

# Simple image to string
print(pytesseract.image_to_string(Image.open('test.png')))

# French text image to string
print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))

# Get bounding box estimates
print(pytesseract.image_to_boxes(Image.open('test.png')))

# Get verbose data including boxes, confidences, line and page numbers
print(pytesseract.image_to_data(Image.open('test.png')))

Support for OpenCV image/NumPy array objects

1
2
3
4
5
6
import cv2

img = cv2.imread('/**path_to_image**/digits.png')
print(pytesseract.image_to_string(img))
# OR explicit beforehand converting
print(pytesseract.image_to_string(Image.fromarray(img))

Add the following config, if you have tessdata error like: “Error opening data file…”

1
2
3
4
5
tessdata_dir_config = '--tessdata-dir "<replace_with_your_tessdata_dir_path>"'
# Example config: '--tessdata-dir "C:\Program Files (x86)\Tesseract-OCR\tessdata"'
# It's important to add double quotes around the dir path.

pytesseract.image_to_string(image, lang='chi_sim', config=tessdata_dir_config)

Functions

Parameters

image_to_data(image, lang=None, config='', nice=0, output_type=Output.STRING)

  • image Object, PIL Image/NumPy array of the image to be processed by Tesseract
  • lang String, Tesseract language code string
  • config String, Any additional configurations as a string, ex: config='--psm 6'
  • nice Integer, modifies the processor priority for the Tesseract run. Not supported on Windows. Nice adjusts the niceness of unix-like processes.
  • output_type Class attribute, specifies the type of the output, defaults to string. For the full list of all supported types, please check the definition of pytesseract.Output class.