Extract Text from PDF using Java

{{ figure align=center src=“images/Extract-Text-From-PDF.jpg” alt=“extract text from pdf java” }}

In this article, discover how to extract text from PDF files effortlessly with Java. Text extraction supports use cases like text analysis, information retrieval, and document parsing. PDFs dominate digital documents, making extraction essential. Let’s explore step‑by‑step PDF text extraction within Java applications.

Java Text Extraction API - Free Download
Extract Text from PDF using Java
Extract Text from Specific Page in PDF
Text Extraction from a Page Region in PDF

Java API to Extract Text from PDF - Free Download

Aspose.PDF for Java is a well‑known PDF manipulation API that offers a powerful text extractor. You can download the API’s JAR or add it to Maven projects with the following configuration.

<repository>
    <id>AsposeJavaAPI</id>
    <name>Aspose Java API</name>
    <url>https://repository.aspose.com/repo/</url>
</repository>

<dependency>
    <groupId>com.aspose</groupId>
    <artifactId>aspose-pdf</artifactId>
    <version>20.11</version>
</dependency>

Extract Text from PDF using Java

Follow these steps to extract text from an entire PDF document.

Load the PDF with the Document class.
Create a TextAbsorber instance.
Accept the absorber for all pages using Document.getPages().accept(TextAbsorber).
Retrieve the extracted text with TextAbsorber.getText().
(Optional) Save the text to a .txt file.

{{ gist aspose-pdf 474c352a71ac9477aa0d604fd32e1c6a “Examples-src-main-java-com-aspose-pdf-examples-NewDocumentObject-text-ExtractTextFromAllThePagesOfPDFDocument-.java” }}

Extract Text from Specific Page in PDF

To extract text from a single page, use these steps.

Load the PDF with the Document class.
Create a TextDevice instance.
Set extraction options via TextExtractionOptions.
Apply options with TextDevice.setExtractionOptions(...).
Call TextDevice.Process(Page, String) for the desired page.

{{ gist aspose-pdf 474c352a71ac9477aa0d604fd32e1c6a “Examples-src-main-java-com-aspose-pdf-examples-NewDocumentObject-text-ExtractTextFromPDFUsingTextDevice-ExtractTextFromParticularPage.java” }}

Extract Text from a Page Region in PDF

To extract text from a specific region, define a rectangle covering that area.

Load the PDF with the Document class.
Create a TextAbsorber instance.
Limit extraction to page bounds and set the rectangle using TextAbsorber.getTextSearchOptions().setLimitToPageBounds(true) and setRectangle(new Rectangle(100, 200, 250, 350)).
Accept the absorber for the target page.
Retrieve the text with TextAbsorber.getText().

{{ gist aspose-pdf 474c352a71ac9477aa0d604fd32e1c6a “Examples-src-main-java-com-aspose-pdf-examples-NewDocumentObject-text-ExtractTextFromAnParticularPageRegion-.java” }}

Conclusion

You now know how to extract text from PDF files using Java. Methods include extracting the whole document, a single page, or a defined page region. Explore more features in the Java PDF API via the documentation.

Java API to Extract Text from PDF - Free Download#

Extract Text from PDF using Java#

Extract Text from Specific Page in PDF#

Extract Text from a Page Region in PDF#

Conclusion#

See Also#

Java API to Extract Text from PDF - Free Download

Extract Text from PDF using Java

Extract Text from Specific Page in PDF

Extract Text from a Page Region in PDF

Conclusion

See Also