Extract Text from PDF using Java

In this post, you will learn how to extract text from PDF files seamlessly using Java. Text extraction could be useful in various scenarios such as text analysis, information retrieval, document parsing, and so on. Since PDF is one of the most widely used digital documents, the use cases of text extraction from PDF documents are more in number. So let’s begin and check how to perform PDF text extraction from within Java applications.

Java Text Extraction API - Free Download
Extract Text from PDF using Java
Extract Text from Specific Page in PDF
Text Extraction from a Page Region in PDF

Java API to Extract Text from PDF - Free Download

Aspose.PDF for Java is a well-known PDF file manipulation API that provides a wide range of features to create and process PDF files. The API contains a powerful text extractor that provides various ways of extracting text from PDF documents within a few lines of code. You can either download the API’s JAR or install it within your Maven-based applications using the following configurations.

<repository>
    <id>AsposeJavaAPI</id>
    <name>Aspose Java API</name>
    <url>https://repository.aspose.com/repo/</url>
</repository>

<dependency>
    <groupId>com.aspose</groupId>
    <artifactId>aspose-pdf</artifactId>
    <version>20.11</version>
</dependency>

Extract Text from PDF using Java

The following are the steps to extract text from a PDF document using Aspose.PDF for Java.

Use Document class to load the PDF file.
Create an object of TextAbsorber class.
Accept the TextAbsorber for all pages of the PDF using Document.getPages().accept(TextAbsorber) method.
Use TextAbsorber.getText() method to fetch all the text from the PDF.
Save the text into a TXT file (optional).

The following code sample shows how to extract text from PDF using Java.

	// For complete examples and data files, please go to https://github.com/aspose-pdf/Aspose.Pdf-for-Java
	// Open document
	Document pdfDocument = new Document("input.pdf");

	// Create TextAbsorber object to extract text
	TextAbsorber textAbsorber = new TextAbsorber();

	// Accept the absorber for all the pages
	pdfDocument.getPages().accept(textAbsorber);

	// Get the extracted text
	String extractedText = textAbsorber.getText();

	// Create a writer and open the file
	java.io.FileWriter writer = new java.io.FileWriter(new java.io.File("Extracted_text.txt"));
	writer.write(extractedText);

	// Write a line of text to the file tw.WriteLine(extractedText);
	// Close the stream
	writer.close();

view raw Examples-src-main-java-com-aspose-pdf-examples-NewDocumentObject-text-ExtractTextFromAllThePagesOfPDFDocument-.java hosted with ❤ by GitHub

Extract Text from Specific Page in PDF

You can also extract text from a specific page of the PDF document using the following steps.

Use Document class to load the PDF file.
Create an instance of TextDevice class.
Define additional options using TextExtractionOptions class.
Set options using TextDevice.setExtractionOptions(TextExtractionOptions) method.
Use TextDevice.Process(Page, String) to extract the text from the specified page.

The following code sample shows how to extract text from a specific page in PDF using Java.

	// For complete examples and data files, please go to https://github.com/aspose-pdf/Aspose.Pdf-for-Java
	// open document
	Document pdfDocument = new Document("input.pdf");
	// create text device
	TextDevice textDevice = new TextDevice();

	// set text extraction options - set text extraction mode (Raw or Pure)
	TextExtractionOptions textExtOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);

	textDevice.setExtractionOptions(textExtOptions);

	// get the text from first page of PDF and save it to file format
	textDevice.process(pdfDocument.getPages().get_Item(1), "ExtractedText.txt");

view raw Examples-src-main-java-com-aspose-pdf-examples-NewDocumentObject-text-ExtractTextFromPDFUsingTextDevice-ExtractTextFromParticularPage.java hosted with ❤ by GitHub

Extract Text from a Page Region in PDF

You can also extract text from a particular region of the page in PDF. For this, you can define a rectangle to cover the region from where you need to extract the text. The following are the steps to extract text from a page region.

Use Document class to load the PDF file.
Create an object of TextAbsorber class.
Set limit to page bound and create a rectangle using TextAbsorber.getTextSearchOptions().setLimitToPageBounds(true) and TextAbsorber.getTextSearchOptions().setRectangle(new Rectangle(100, 200, 250, 350)) methods respectively.
Accept the absorber for the particular page.
Use TextAbsorber.getText() method to extract text.

The following code sample shows how to extract text from a particular page region in Java.

	// For complete examples and data files, please go to https://github.com/aspose-pdf/Aspose.Pdf-for-Java
	// open document
	Document doc = new Document("page_0001.pdf");

	// create TextAbsorber object to extract text
	TextAbsorber absorber = new TextAbsorber();
	absorber.getTextSearchOptions().setLimitToPageBounds(true);
	absorber.getTextSearchOptions().setRectangle(new Rectangle(100, 200, 250, 350));
	// accept the absorber for first page
	doc.getPages().get_Item(1).accept(absorber);

	// get the extracted text
	String extractedText = absorber.getText();
	// create a writer and open the file
	BufferedWriter writer = new BufferedWriter(new FileWriter(new java.io.File("ExtractedText.txt")));
	// write extracted contents
	writer.write(extractedText);
	// Close writer
	writer.close();

view raw Examples-src-main-java-com-aspose-pdf-examples-NewDocumentObject-text-ExtractTextFromAnParticularPageRegion-.java hosted with ❤ by GitHub

Conclusion

In this article, you have learned how to extract text from PDF using Java. You have seen various ways of text extraction such as extracting text from a whole PDF, a specific page, or a specific page region. You can learn more about the Java PDF API using documentation.

Java API to Extract Text from PDF - Free Download#

Extract Text from PDF using Java#

Extract Text from Specific Page in PDF#

Extract Text from a Page Region in PDF#

Conclusion#

See Also#

Java API to Extract Text from PDF - Free Download

Extract Text from PDF using Java

Extract Text from Specific Page in PDF

Extract Text from a Page Region in PDF

Conclusion

See Also