OCR using PyTesseract (Python)

Introduction
Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images.
PyTesseract uses Google’s Tesseract OCR Engine. In this article, we develop a program to extract text from an image.
Environment Setup
Let’s use Google Colab to test. We need to install tesseract-OCR and also install pytesseract
!sudo apt install tesseract-ocr
!pip install pytesseract
Let's mount Google Drive to Colab, so we can use files from GDrive.
from google.colab import drive
drive.mount('/content/drive')
Reading Image
The image is read using OpenCV.

Let's run OCR on this image to see the initial results.
Their style of cooking goes with the look of their kitchen: straight, open and sin- cere, They use many fresh products and the results are both appetising and ap- pealing. “We really respect the top qual-
We see the complete text could not be extracted, let's do some preprocessing before passing the image to tesseract.
Preprocessing
Skew Correction
We need to ensure that the text is always straight and provision for rotating the image if not.
Let's use the same image used before but shot at an angle and run OCR on it.

The pytesseract is unable to extract any text, hence no result is printed.


R, a customised stainless steel ventilation sys Zu- tem in the ceiling, Ieensures that the air isof is clean and che climate is right, ‘Its like ions, working in a show kitchen, says Ivo pain Berger, the kitchen chef of PUR and ords holder of 15 Gault Millau points. The even restaurant currently employs eleven Ei chefs from three different nations. But ca that’s nor all, Three of them are winners ge ight of the Culinary World Cup. ed Tt @, ‘Their style of cooking goes with the look te
Grayscale
OpenCV reads the image is read as BGR by default. It is converted to grayscale.

‘Their style of cooking goes with the look of their kitchen: straight, open and sin- cere, They use many fresh products and the results are both appetising and ap- pealing. “We really respect the top qual-
Binarization
It converts the image to black and white, or at least its tries. The pixels are either converted to 0 or 255, where we give a threshold.
Here we use CV2 Simple Thresholding.

R, a customised stainless steel ventilation sys- Za tem in the ceiling, Ieensures that the air tsof isclean and che climate is right, ‘It’s like ions, working in a show kitchen, says Ivo pain Berger, the kitchen chef of PUR and ords holder of 15 Gault Millau points. The even restaurant currently employs eleven Ei chefs from three different nations. Bur ca thar's not all, Three of them are winners ge ight of the Culinary World Cup. ed Tt
@, ‘Their style of cooking goes with the look re in of their kitchen: straight, open and sin- pr
cere. They use many fresh products and Rt the results are both appetising and ap- in pealing. ‘We really respect , Vi aa
Summary
We can add more preprocessing functions like Noise Removal, Dilate, Erode, and Canny filter. However, in this use case, the functions did not improve the result.
Here is the complete code.