extract text from pdf java

In this post, you will learn how to extract text from PDF files seamlessly using Java. Text extraction could be useful in various scenarios such as text analysis, information retrieval, document parsing, and so on. Since PDF is one of the most widely used digital documents, the use cases of text extraction from PDF documents are more in number. So let’s begin and check how to perform PDF text extraction from within Java applications.

Java API to Extract Text from PDF - Free Download

Aspose.PDF for Java is a well-known PDF file manipulation API that provides a wide range of features to create and process PDF files. The API contains a powerful text extractor that provides various ways of extracting text from PDF documents within a few lines of code. You can either download the API’s JAR or install it within your Maven-based applications using the following configurations.

<repository>
    <id>AsposeJavaAPI</id>
    <name>Aspose Java API</name>
    <url>https://repository.aspose.com/repo/</url>
</repository>
<dependency>
    <groupId>com.aspose</groupId>
    <artifactId>aspose-pdf</artifactId>
    <version>20.11</version>
</dependency>

Extract Text from PDF using Java

The following are the steps to extract text from a PDF document using Aspose.PDF for Java.

The following code sample shows how to extract text from PDF using Java.

// For complete examples and data files, please go to https://github.com/aspose-pdf/Aspose.Pdf-for-Java
// Open document
Document pdfDocument = new Document("input.pdf");
// Create TextAbsorber object to extract text
TextAbsorber textAbsorber = new TextAbsorber();
// Accept the absorber for all the pages
pdfDocument.getPages().accept(textAbsorber);
// Get the extracted text
String extractedText = textAbsorber.getText();
// Create a writer and open the file
java.io.FileWriter writer = new java.io.FileWriter(new java.io.File("Extracted_text.txt"));
writer.write(extractedText);
// Write a line of text to the file tw.WriteLine(extractedText);
// Close the stream
writer.close();

Extract Text from Specific Page in PDF

You can also extract text from a specific page of the PDF document using the following steps.

The following code sample shows how to extract text from a specific page in PDF using Java.

// For complete examples and data files, please go to https://github.com/aspose-pdf/Aspose.Pdf-for-Java
// open document
Document pdfDocument = new Document("input.pdf");
// create text device
TextDevice textDevice = new TextDevice();
// set text extraction options - set text extraction mode (Raw or Pure)
TextExtractionOptions textExtOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);
textDevice.setExtractionOptions(textExtOptions);
// get the text from first page of PDF and save it to file format
textDevice.process(pdfDocument.getPages().get_Item(1), "ExtractedText.txt");

Extract Text from a Page Region in PDF

You can also extract text from a particular region of the page in PDF. For this, you can define a rectangle to cover the region from where you need to extract the text. The following are the steps to extract text from a page region.

The following code sample shows how to extract text from a particular page region in Java.

// For complete examples and data files, please go to https://github.com/aspose-pdf/Aspose.Pdf-for-Java
// open document
Document doc = new Document("page_0001.pdf");
// create TextAbsorber object to extract text
TextAbsorber absorber = new TextAbsorber();
absorber.getTextSearchOptions().setLimitToPageBounds(true);
absorber.getTextSearchOptions().setRectangle(new Rectangle(100, 200, 250, 350));
// accept the absorber for first page
doc.getPages().get_Item(1).accept(absorber);
// get the extracted text
String extractedText = absorber.getText();
// create a writer and open the file
BufferedWriter writer = new BufferedWriter(new FileWriter(new java.io.File("ExtractedText.txt")));
// write extracted contents
writer.write(extractedText);
// Close writer
writer.close();

Conclusion

In this article, you have learned how to extract text from PDF using Java. You have seen various ways of text extraction such as extracting text from a whole PDF, a specific page, or a specific page region. You can learn more about the Java PDF API using documentation.

See Also

Info: Aspose recently developed a free online Text to GIF service that allows you to animate texts or generate GIFs from simple texts.