OCR PDF Python

Scanned PDF documents are often challenging to work with due to their lack of searchable or editable text. However, with the power of Optical Character Recognition (OCR) technology, extracting text from scanned PDFs and converting them into searchable or editable formats becomes a reality. In this blog post, you will learn how to perform PDF text recognition with OCR in Python. We will also explore how to extract text from scanned PDF files, convert them into searchable or editable PDFs, and unleash the potential of Python’s OCR capabilities using Aspose.OCR for Python via .NET library.

Recognize Text from Scanned PDF with OCR – Python API Installation

Optical Character Recognition (OCR) is a technology that allows the conversion of images or scanned documents into machine-readable text. By analyzing the shapes and patterns of characters in an image, OCR algorithms identify and recognize text, making it possible to extract and process the information contained within. Before getting started, you need to install Aspose.OCR for Python via .NET by downloading it from the New Releases page or configure it from PyPi by running the installation command below:

pip install aspose-ocr-python-net

Recognize Text from PDF with OCR in Python

You can recognize or extract text from PDF with OCR in Python. It will extract the text from a scanned PDF document efficiently as the steps below outline the simple process to recognize text from PDF with OCR in Python:

  1. Instantiate an object of AsposeOcr class.
  2. Load the scanned PDF file.
  3. Recognize text with OCR and print the output to the console.

The sample code below shows how to recognize text from PDF with OCR in Python:

import aspose.ocr as ocr

# Initialize an object of AsposeOcr class
api = ocr.AsposeOcr()

# Load the scanned PDF file
input = ocr.OcrInput(ocr.InputType.PDF)
input.add("source.pdf")

# Recognize text with OCR
result = api.recognize(input)

# Print the output text to the console
print(result[0].recognition_text)

Convert Scanned PDF to Searchable or Editable PDF with OCR in Python

Scanned PDF files contain images where you can not search for the text so you need to convert it to a searchable PDF document to make the document machine-readable and process it further accordingly. Please follow the steps below to convert it to a searchable or editable PDF document in Python:

  1. Create an object of AsposeOcr class.
  2. Initialize the [RecognitionSettings][5] class instance and set the required properties.
  3. Load the PDF file and set the page range for recognition with OCR.
  4. Save the output searchable PDF file.

The following sample code shows how to convert a scanned PDF to a searchable PDF document with OCR in Python:

import aspose.ocr as ocr

api = ocr.AsposeOcr()

# Initialize RecognitionSettings
settings = ocr.RecognitionSettings()
settings.auto_denoising = True
settings.auto_contrast = True

# Specify the PDF document as input
input = ocr.OcrInput(ocr.InputType.PDF)

# Access the scanned PDF and set the page number and total number of pages
input.add("source.pdf", 0, 1)

# Process the PDF file for text recognition with OCR
result = api.recognize(input , settings)

# Save the searchable output PDF file
api.save_multipage_document("searchable.pdf", ocr.SaveFormat.PDF, result)

Here it is noteworthy that you can OCR any range of pages in the PDF document. For example, recognizing text from specific pages only where the page index is zero-based and the last parameter is the count for the number of pages to be processed with the API. Additionally, you can set different Recognition settings for preprocessing of the source file like removing the noise, setting the contrast, checking the skewing of input pages, etc. for enhanced and precise recognition of the text with the OCR.

Get a Free Evaluation License

You may request a free temporary license to evaluate the API without any evaluation limitations.

Summing Up

With the power of OCR technology and Python, extracting text from scanned PDFs and converting them into searchable or editable formats has become highly accessible. Here we have explored the process of PDF text recognition with OCR in Python. We discussed the installation process and the extraction of text from scanned PDFs, OCR implementation, and the conversion of scanned PDFs to searchable or editable formats. By leveraging OCR capabilities and employing advanced techniques, you can unlock the full potential of scanned PDFs and make them more accessible and versatile in your projects. In case of any ambiguities or queries, please reach out to us via the free support forum.

See Also