{{< figure align=center src=“images/Extract-Text-From-PDF.jpg” alt=“extract text from pdf java”>}}
In this article, discover how to extract text from PDF files effortlessly with Java. Text extraction supports use cases like text analysis, information retrieval, and document parsing. PDFs dominate digital documents, making extraction essential. Let’s explore step‑by‑step PDF text extraction within Java applications.
- Java Text Extraction API - Free Download
- Extract Text from PDF using Java
- Extract Text from Specific Page in PDF
- Text Extraction from a Page Region in PDF
Java API to Extract Text from PDF - Free Download
Aspose.PDF for Java is a well‑known PDF manipulation API that offers a powerful text extractor. You can download the API’s JAR or add it to Maven projects with the following configuration.
<repository>
<id>AsposeJavaAPI</id>
<name>Aspose Java API</name>
<url>https://repository.aspose.com/repo/</url>
</repository>
<dependency>
<groupId>com.aspose</groupId>
<artifactId>aspose-pdf</artifactId>
<version>20.11</version>
</dependency>
Extract Text from PDF using Java
Follow these steps to extract text from an entire PDF document.
- Load the PDF with the Document class.
- Create a TextAbsorber instance.
- Accept the absorber for all pages using
Document.getPages().accept(TextAbsorber). - Retrieve the extracted text with
TextAbsorber.getText(). - (Optional) Save the text to a .txt file.
{{< gist aspose-pdf 474c352a71ac9477aa0d604fd32e1c6a “Examples-src-main-java-com-aspose-pdf-examples-NewDocumentObject-text-ExtractTextFromAllThePagesOfPDFDocument-.java” >}}
Extract Text from Specific Page in PDF
To extract text from a single page, use these steps.
- Load the PDF with the Document class.
- Create a TextDevice instance.
- Set extraction options via TextExtractionOptions.
- Apply options with
TextDevice.setExtractionOptions(...). - Call
TextDevice.Process(Page, String)for the desired page.
{{< gist aspose-pdf 474c352a71ac9477aa0d604fd32e1c6a “Examples-src-main-java-com-aspose-pdf-examples-NewDocumentObject-text-ExtractTextFromPDFUsingTextDevice-ExtractTextFromParticularPage.java” >}}
Extract Text from a Page Region in PDF
To extract text from a specific region, define a rectangle covering that area.
- Load the PDF with the Document class.
- Create a TextAbsorber instance.
- Limit extraction to page bounds and set the rectangle using
TextAbsorber.getTextSearchOptions().setLimitToPageBounds(true)andsetRectangle(new Rectangle(100, 200, 250, 350)). - Accept the absorber for the target page.
- Retrieve the text with
TextAbsorber.getText().
{{< gist aspose-pdf 474c352a71ac9477aa0d604fd32e1c6a “Examples-src-main-java-com-aspose-pdf-examples-NewDocumentObject-text-ExtractTextFromAnParticularPageRegion-.java” >}}
Conclusion
You now know how to extract text from PDF files using Java. Methods include extracting the whole document, a single page, or a defined page region. Explore more features in the Java PDF API via the documentation.
See Also
Info: Aspose recently developed a free online Text to GIF service that allows you to animate texts or generate GIFs from simple texts.