Friday

Extract Numbers from Images by OCR

 

Photo by Claudio Schwarz on Unsplash

All business orgranizations required to process unstructured data especially images. Like invoices we need to process manually to fed data into structured form eg. database or spred sheet.

You can do this with the help of Tesseract OCR, there are other OCR libraries and APIs available that you can use to read text from images. Here are a few alternatives:

1. **Google Cloud Vision API**: Google Cloud Vision API provides powerful OCR capabilities. You can send image data to the API and receive text extraction results. It supports various languages and provides options for advanced OCR features such as document layout analysis. You’ll need to sign up for the Google Cloud platform and set up the Vision API to use this service.

2. **Microsoft Azure Computer Vision API**: Microsoft Azure offers the Computer Vision API, which includes OCR functionality. It allows you to extract text from images, supports multiple languages, and provides options for advanced OCR features. You’ll need to sign up for Microsoft Azure and create an API key to access the Computer Vision API.

3. **PyOCR**: PyOCR is a Python library that provides a simple interface to various OCR engines, including Tesseract, OCRopus, and Google Tesseract OCR. It allows you to extract text from images using different OCR engines by providing a consistent interface. You can install PyOCR using pip (`pip install pyocr`) and choose the OCR engine you want to use.

4. **Amazon Textract**: Amazon Textract is a fully managed OCR service provided by Amazon Web Services (AWS). It enables you to extract text and data from images and PDF documents. You can integrate Textract into your applications using the AWS SDKs or command-line interface (CLI). You’ll need to sign up for AWS and set up the Textract service to use this OCR solution.

These are just a few alternatives to Tesseract OCR. Each option has its own features, advantages, and usage requirements. You can explore these options and choose the one that best suits your needs and constraints.

However we are going make our application FREE of cloud tool cost with the help of open source libraries and Python.

Steps we will follow for this demo project are following:

Download any invoice image from internet for demo. I have taken this one

Now we need to get the co-ordinate of all those data we want to extract and read. For this demo we will read only three numbers from this invoice image.

To get the co-ordinates of those three numbers to get position in the invoice image. Go to https://pixspy.com/ [you can use any of this kind of tool available in your laptop.

We need x1, y1, x2, y2 position values for each of those numbers. x1, y1 is the near left top corner of a number and x2, y2 are lower right corner co-ordinate. When you hover your pointer on specific place [keep little gap not much from the number to read the co-ordinates] and write down the values.

Now we need to install few librarires for python application including pytesseract and opencv if they already not installed in your system. Take help from the below help. This example, you need:

We will use image processing techniques along with Optical Character Recognition (OCR) to recognize and extract text or other relevant information.

Here is a general outline of the steps involved in extracting data from specific spaces in an image:

Load and preprocess the image: Use a suitable library, such as OpenCV or PIL, to load the image. Preprocess the image as necessary, which may involve resizing, cropping, or enhancing the image to improve text recognition.

Identify and localize the specific spaces: Use image processing techniques to locate and isolate the regions of interest (ROIs) that contain the data you want to extract. This may involve techniques such as edge detection, contour detection, or template matching, depending on the specific characteristics of the spaces you want to extract data from.

Perform OCR on the ROIs: Apply OCR algorithms to the localized ROIs to recognize and extract the text or relevant information. Tesseract is a popular open-source OCR engine that you can use in combination with Python libraries like pytesseract to extract text from images.

Post-process the extracted data: Once you have extracted the text from the ROIs, you can perform additional post-processing steps to clean, validate, or format the extracted data as per your requirements.
You can save the extracted data into database [not added those process in this demo script].

Keep in mind that the specific implementation may vary depending on the nature of the images and the spaces you are working with. It’s important to experiment with different image processing techniques, OCR settings, and post-processing steps to achieve accurate and reliable extraction of data from the specific spaces in your images.

After done the installation of libraries and collected the co-ordinates. Now we need to write the code. You can use any IDE or jupyter notebook for this.

import cv2
import pytesseract
import matplotlib.pyplot as plt
import re

# Load the image change as per your invoice image name and path
image = cv2.imread('invoice-template-us-neat-750px.png')

def read_character(roi_coordinates):
"""
This will read the image part and find the word or number in it
"""

# Iterate over the ROIs
for i, (x1, y1, x2, y2) in enumerate(roi_coordinates):
# Ensure the coordinates are within the image dimensions
x1 = max(0, x1)
y1 = max(0, y1)
x2 = min(image.shape[1], x2)
y2 = min(image.shape[0], y2)

# Crop the ROI from the image
roi = image[y1:y2, x1:x2]

# Convert the ROI to grayscale
gray_roi = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)

# Display the grayscale ROI
plt.figure()
plt.imshow(cv2.cvtColor(gray_roi, cv2.COLOR_GRAY2RGB))
plt.axis('off')
plt.show()

# Apply image preprocessing if required
# Example 1: Thresholding
_, thresholded_roi = cv2.threshold(gray_roi, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)

# Example 2: Denoising (using a bilateral filter)
denoised_roi = cv2.bilateralFilter(thresholded_roi, 9, 75, 75)

# Perform OCR on the preprocessed ROI using pytesseract
extracted_text = pytesseract.image_to_string(denoised_roi, config='--psm 7') # Use page segmentation mode 7 for treating the image as a single line of text

# Extract numbers from the extracted text
numbers = re.findall(r'\d+', extracted_text)

# Display the extracted numbers
if numbers:
print(f"Numbers from ROI: {'.'.join(numbers)}")
else:
print(f"No numbers found in ROI")

# change the co-ordinate values as per your required numbers location
# in image
image_parts = [{'x1':640, 'y1':385, 'x2':690, 'y2':400},
{'x1':640, 'y1':410, 'x2':690, 'y2':435},
{'x1':640, 'y1':448, 'x2':690, 'y2':470}]

# Define the regions of interest (ROIs) where you want to extract data
for coordinate in image_parts:
x1, y1, x2, y2 = coordinate.values()
roi_coordinates = [
(x1, y1, x2, y2), # Format: (top-left x, top-left y, bottom-right x, bottom-right y)
# Add more ROI coordinates as needed
]
read_character(roi_coordinates)

Output should looks like this one

You can also use some other technique to optimize this code and process.

If you like different AI/ML and other microservices template code, kindly visit my personal github repo here https://github.com/dhirajpatra.

Hope this will help you. Thank you.

No comments:

Financial Market Regulati