Building an Optical Character Recognition (OCR) Engine in Python

Learn how to use the Tesseract OCR library and pytesseract wrapper for optical character recognition (OCR) to convert text in images into digital text in Python.

Python Image Text Recognition: Humans can easily understand the text content of an image just by looking at it. However, this is not the case for computers. They need some kind of structured approach or algorithm to understand it. This is where Optical Character Recognition (OCR) comes in.

How does Python recognize text in images? Optical character recognition is the process of detecting the text content on an image and converting it into machine-coded text, which we can access and manipulate as a string variable in Python (or any programming language). In this tutorial, we’ll use the Tesseract library to do just that.

The Tesseract library contains an OCR engine and a command-line program, so it has nothing to do with Python, follow their official guide to install it as it is a must-have tool for this tutorial.

We’ll be using Python’s pytesseract module, which is a wrapper around the Tesseract-OCR engine, so we can access it via Python.

The latest stable version of tesseract is 4, which uses a new recurrent neural network (LSTM)-based OCR engine that focuses on line recognition.

Related article: How to convert speech to text in Python.

Python ORC Recognition of Text in Photos Example – Let’s get started, you need to install:

Tesseract-OCR engine (follow their operating system guidelines).
pytesseract wrapper module using: copypip3 install pytesseract
Other useful modules of this tutorial: Copyingpip3 install numpy matplotlib opencv-python pillow

Once everything is installed on your machine, open a new Python file and follow these steps:

import pytesseract
import cv2
import matplotlib.pyplot as plt
from PIL import Image

For demonstration purposes, I’m going to use this image for identification:

I’ll name it “test.png” and put it in the current directory, let’s load this image:

# read the image using OpenCV
image = cv2.imread("test.png")
# or you can use Pillow
# image = Image.open("test.png")

How does Python recognize text in images? You may notice that you can load images using OpenCV or Pillow, and I prefer to use OpenCV because it allows us to use a live camera.

Let’s recognize this text:

# get the string
string = pytesseract.image_to_string(image)
# print it
print(string)

Note: If the above code throws an error, consider adding the Tesseract-OCR binary to the PATH variable. Read their official installation guide more closely.

Python Image Text Recognition: The image_to_string() function does exactly what you would expect, it converts the contained image text into characters, let’s see the result:

This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.

The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.

Great, there’s also a function image_to_data() that outputs more information, including width, height, and x, y coordinates for the word, which will allow us to make a lot of useful things. For example, let’s search for a word in a document and draw a bounding box around a specific word of our choice, the following code handles:

# make a copy of this image to draw in
image_copy = image.copy()
# the target word to search for
target_word = "dog"
# get all data from the image
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)

Python ORC recognizes examples of text in photos: so we’re going to search for the word “dog” in a text document, and we want the output data to be structured rather than the original string, which is why I’m passing the output_type as a dictionary so that we can easily get the data for each word (you can print the data dictionary to see how the output is organized).

Let’s get all the occurrences of the word:

# get all occurences of the that word
word_occurences = [ i for i, word in enumerate(data["text"]) if word.lower() == target_word ]

Now let’s draw a bounding box on each word:

for occ in word_occurences:
    # extract the width, height, top and left position for that detected word
    w = data["width"][occ]
    h = data["height"][occ]
    l = data["left"][occ]
    t = data["top"][occ]
    # define all the surrounding box points
    p1 = (l, t)
    p2 = (l + w, t)
    p3 = (l + w, t + h)
    p4 = (l, t + h)
    # draw the 4 lines (rectangular)
    image_copy = cv2.line(image_copy, p1, p2, color=(255, 0, 0), thickness=2)
    image_copy = cv2.line(image_copy, p2, p3, color=(255, 0, 0), thickness=2)
    image_copy = cv2.line(image_copy, p3, p4, color=(255, 0, 0), thickness=2)
    image_copy = cv2.line(image_copy, p4, p1, color=(255, 0, 0), thickness=2)

To save and display the result image:

plt.imsave("all_dog_words.png", image_copy)
plt.imshow(image_copy)
plt.show()

How does Python recognize text in images? Take a look at the results:

How does Python recognize text in images?

Amazing, isn’t it? And that’s not all! You can pass the lang parameter to the image_to_string() or image_to_data() function to easily recognize text in different languages. You can also use the image_to_boxes() function to identify characters and their box boundaries, see their official documentation and available languages for more information.

Note, though, that this Python image text recognition method is great for recognizing text in scanned documents and paper. Other uses of OCR include passport recognition and automation of extracting information from it, data entry processes, detection and recognition of license plates, and more!

Also, this does not apply to handwritten text, complex real-world images, and images that are not clear or contain a lot of text.

Okay, that’s what this tutorial is about, let’s see what you can build with this utility!

Artificial Intelligence

Building an Optical Character Recognition (OCR) Engine in Python

Thao Nguyen