PDF to Word OCR Python

Converting scanned PDFs to Word documents offers several advantages like editing the text within the document, making it easy to make changes or updates. It also enables text searchability, which is invaluable for large documents or when conducting research. Moreover, you can also perform spell-checking for correcting any typos or misspelled words while performing OCR in Python. Accordingly, this article explains how to convert scanned PDF to Word document with OCR in Python using Aspose.OCR for Python via .NET library.

PDF to Word with OCR – Python API Installation

Before we dive into text recognition, let’s ensure that we have the necessary environment set up to run OCR in Python. Make sure you have Python installed on your system, preferably version 3.x or later, along with a reliable code editor or integrated development environment (IDE) such as Visual Studio Code or IDLE, etc. Then you need to configure Aspose.OCR for Python via .NET while accessing it from the New Releases section or from PyPi with the following installation command:

pip install aspose-ocr-python-net

Convert Scanned PDF to Word with OCR in Python

You can convert a scanned PDF to Word with OCR by following the steps below:

  1. Initialize the API using the AsposeOcr class.
  2. Set different settings for the recognition.
  3. Recognize the text with OCR and save the output DOCX Word file.

The following code snippet demonstrates how to convert scanned PDF to Word with OCR in Python:

import aspose.ocr as ocr

api = ocr.AsposeOcr()

# Initialize RecognitionSettings
settings = ocr.RecognitionSettings()
settings.auto_denoising = True
settings.auto_contrast = True

input = ocr.OcrInput(ocr.InputType.PDF)
input.add(path + "source.pdf", 0, 1)

result = api.recognize(input , settings)

api.save_multipage_document("searchable.docx", ocr.SaveFormat.DOCX, result)

print(result[0].recognition_text)

PDF to Word with OCR and Spell Checking in Python

OCR engines may sometimes produce inaccuracies, especially when dealing with complex layouts, handwriting, or low-quality scans. In such cases, spell correction plays a crucial role in improving the accuracy of the converted text. This section particularly addresses PDF to Word conversion with OCR and the spell-checking feature in Python. You need to follow the steps below to meet these requirements:

  1. Initialize an instance of AsposeOcr class.
  2. Set different properties using the RecognitionSettings class.
  3. Recognize the PDF with OCR and spell-check the extracted string.
  4. Export the output Word document in DOCX format.

The sample code below explains how to convert a PDF to a Word document with OCR in Python:

import aspose.ocr as ocr

api = ocr.AsposeOcr()

# Initialize RecognitionSettings
settings = ocr.RecognitionSettings()
settings.auto_denoising = True
settings.auto_contrast = True

input = ocr.OcrInput(ocr.InputType.PDF)
input.add(path + "Spell Check OCR PDF.pdf", 0, 1)

result = api.recognize(input , settings)

corrected = api.correct_spelling(result[0].recognition_text, ocr.spellchecker.SpellCheckLanguage.ENG, None)
# Print the text after spell correction
print(corrected)

# Save each page with spell correction separately
result[0].save("test.docx", ocr.SaveFormat.DOCX, True, ocr.spellchecker.SpellCheckLanguage.ENG, None)

Get a Free Evaluation License

You can get a free temporary license to avoid any evaluation limitations and watermarks.

Summing Up

In this blog post, we have explored how to convert scanned PDFs to Word documents using OCR in Python. We discussed the importance of OCR and its benefits, provided a step-by-step guide for setting up the environment, extracting text from the PDF document with different approaches while specifying several settings, and saving it to a Word document. This guide enables you to automate the conversion of scanned PDFs to editable Word documents using Python and OCR, opening up a world of possibilities for data extraction and manipulation. In case you need to discuss any of your concerns, please feel free to write to us at the free support forum.

See Also