In this post, you will learn how to extract text from PDF files seamlessly using Java. Text extraction could be useful in various scenarios such as text analysis, information retrieval, document parsing, and so on. Since PDF is one of the most widely used digital documents, the use cases of text extraction from PDF documents are more in number. So let’s begin and check how to perform PDF text extraction from within Java applications.
- Java Text Extraction API - Free Download
- Extract Text from PDF using Java
- Extract Text from Specific Page in PDF
- Text Extraction from a Page Region in PDF
Java API to Extract Text from PDF - Free Download
Aspose.PDF for Java is a well-known PDF file manipulation API that provides a wide range of features to create and process PDF files. The API contains a powerful text extractor that provides various ways of extracting text from PDF documents within a few lines of code. You can either download the API’s JAR or install it within your Maven-based applications using the following configurations.
<repository>
<id>AsposeJavaAPI</id>
<name>Aspose Java API</name>
<url>https://repository.aspose.com/repo/</url>
</repository>
<dependency>
<groupId>com.aspose</groupId>
<artifactId>aspose-pdf</artifactId>
<version>20.11</version>
</dependency>
Extract Text from PDF using Java
The following are the steps to extract text from a PDF document using Aspose.PDF for Java.
- Use Document class to load the PDF file.
- Create an object of TextAbsorber class.
- Accept the TextAbsorber for all pages of the PDF using Document.getPages().accept(TextAbsorber) method.
- Use TextAbsorber.getText() method to fetch all the text from the PDF.
- Save the text into a TXT file (optional).
The following code sample shows how to extract text from PDF using Java.
Extract Text from Specific Page in PDF
You can also extract text from a specific page of the PDF document using the following steps.
- Use Document class to load the PDF file.
- Create an instance of TextDevice class.
- Define additional options using TextExtractionOptions class.
- Set options using TextDevice.setExtractionOptions(TextExtractionOptions) method.
- Use TextDevice.Process(Page, String) to extract the text from the specified page.
The following code sample shows how to extract text from a specific page in PDF using Java.
Extract Text from a Page Region in PDF
You can also extract text from a particular region of the page in PDF. For this, you can define a rectangle to cover the region from where you need to extract the text. The following are the steps to extract text from a page region.
- Use Document class to load the PDF file.
- Create an object of TextAbsorber class.
- Set limit to page bound and create a rectangle using TextAbsorber.getTextSearchOptions().setLimitToPageBounds(true) and TextAbsorber.getTextSearchOptions().setRectangle(new Rectangle(100, 200, 250, 350)) methods respectively.
- Accept the absorber for the particular page.
- Use TextAbsorber.getText() method to extract text.
The following code sample shows how to extract text from a particular page region in Java.
Conclusion
In this article, you have learned how to extract text from PDF using Java. You have seen various ways of text extraction such as extracting text from a whole PDF, a specific page, or a specific page region. You can learn more about the Java PDF API using documentation.
See Also
Info: Aspose recently developed a free online Text to GIF service that allows you to animate texts or generate GIFs from simple texts.