Scanned PDF to Word Java OCR

The PDF files created using a camera or scanner device contain scanned images. Such images cannot be processed for text selection or editing so you might need to convert scanned PDF to Word documents in DOCX or DOC format. This article covers how to convert a scanned PDF file to a Word file programmatically using Java.

Java API to Convert Scanned PDF to Word File

You can manipulate scanned PDF documents with OCR operations using Aspose.OCR for Java API and then generate a Word file with Aspose.Words for Java API programmatically. Simply set up the APIs by downloading the JAR files from the Downloads section or using the following Maven specifications:

Repository:

<repository>
    <id>AsposeJavaAPI</id>
    <name>Aspose Java API</name>
    <url>http://repository.aspose.com/repo/</url>
</repository>

Dependency:

<dependency>
    <groupId>com.aspose</groupId>
    <artifactId>aspose-ocr</artifactId>
    <version>21.11</version>
    <artifactId>aspose-words</artifactId>
    <version>21.12</version>
</dependency>

Convert Scanned PDF to Word Document Programmatically using Java

You can convert a scanned PDF file to a Word document with optical character recognition technique. This is a two-step process where the scanned PDF is converted to text and then the text is converted to a Word document in DOC or DOCX format. You need to follow the steps below to convert scanned PDF to a Word document:

  1. Instantiate AsposeOCRPdf class object.
  2. Recognize images from PDF file using DocumentRecognitionSettings type object.
  3. Specify String class object and save the text.
  4. Initialize a new word document with the Document class.
  5. Set the fonts and paragraphs formatting.
  6. Finally, write the output Word document to disk as DOCX or DOC file.

The code snippet below demonstrates how to convert a scanned PDF file to a Word document as DOC or DOCX file programmatically using Java:

Get Free Temporary License

You can evaluate the APIs without any limitations by requesting a free temporary license.

Conclusion

In this article, you have explored how to convert a scanned PDF file to a Word document as DOCX or DOC file programmatically using Java. Furthermore, you can take a look at other OCR-related features by visiting the documentation. In case of any concerns, please feel free to contact us at the forum.

See Also

Info: You may be interested in another Java API (Aspose.Slides for Java) that allows you to convert presentations (into PDFs, word documents, etc.) and import images or other documents into presentations.