Java: Extract Tables from PDF

PDF is one of the ruling document formats these days. It supports text, graphics, tables, annotations, and a range of other elements. In certain cases, you may need to extract data from the tables in PDF documents such as invoices. To achieve this programmatically, the article covers how to extract data from tables in PDF using Java.

Java API to Extract Data from Tables in PDF
Extract Data from Tables in PDF
Extract Table from a Specific Area of Page

Java Library to Extract PDF Table

To extract tables in PDF files, we will use Aspose.PDF for Java. It is a feature-rich library for creating and manipulating PDF documents. You can either download the library’s JAR or install it within your Java application using the following Maven configurations.

<repository>
    <id>AsposeJavaAPI</id>
    <name>Aspose Java API</name>
    <url>https://repository.aspose.com/repo/</url>
</repository>
<dependency>
    <groupId>com.aspose</groupId>
    <artifactId>aspose-pdf</artifactId>
    <version>22.12</version>
</dependency>

Extract Tables from a PDF in Java

The following are the steps to extract tables in a PDF in Java.

First, load the PDF document using the Document class.
Loop through the pages in PDF using Document.getPages() collection.
In each iteration, initialize the TableAbsorber object and visit the selected page using TableAbsorber.visit(Page) method.
In a nested loop, iterate through the list of the tables in TableAbsorber.getTableList() collection.
For each AbsorbedTable in the collection, iterate through the collection of rows from AbsorbedTable.getRowList().
For each AbsorbedRow in the collection, iterate through the collection of cells from AbsorbedRow.getCellList().
Finally, loop through the TextFragment collection of each AbsorbedCell and print the text.

The following code sample shows how to extract a table in PDF using Java.

	// Load source PDF document
	com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document("sample.pdf");

	// Create a table absrober
	com.aspose.pdf.TableAbsorber absorber = new com.aspose.pdf.TableAbsorber();

	// Scan pages
	for (com.aspose.pdf.Page page : pdfDocument.getPages()) {

	// Vist page
	absorber.visit(page);

	// Iterate through tables
	for (com.aspose.pdf.AbsorbedTable table : absorber.getTableList()) {
	System.out.println("Table");

	// Iterate throught list of rows
	for (com.aspose.pdf.AbsorbedRow row : table.getRowList()) {

	// Iterate throught list of cell
	for (com.aspose.pdf.AbsorbedCell cell : row.getCellList()) {

	// Iterate through text fragments
	for (com.aspose.pdf.TextFragment fragment : cell.getTextFragments()) {
	StringBuilder sb = new StringBuilder();
	for (com.aspose.pdf.TextSegment seg : fragment.getSegments())
	sb.append(seg.getText());
	System.out.print(sb.toString() + "\|");
	}
	}
	System.out.println();
	}
	}
	}

view raw extract-text-from-pdf-table.java hosted with ❤ by GitHub

Java: Extract PDF Table from a Specific Area

The following are the steps to extract table from a specific part of the page in a PDF.

First, load the PDF document using the Document class.
Then, select the desired Page from Document.getPages() collection.
Extract the Square annotation of the page.
In each iteration, initialize the TableAbsorber object and visit the selected page using TableAbsorber.visit(Page) method.
In a nested loop, iterate through the list of the tables in TableAbsorber.getTableList() collection.
Check if the absorbed table is in the desired page region.
For each AbsorbedTable in the collection, iterate through the collection of rows using AbsorbedTable.getRowList().
For each AbsorbedRow in the collection, iterate through the collection of cells using AbsorbedRow.getCellList().
Finally, loop through the TextFragment collection of each AbsorbedCell and print the text.

The following code sample shows how to extract tables from a specific region of the PDF page.

	// Load source PDF document
	com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document("sample.pdf");

	// Get reference of the desired page
	com.aspose.pdf.Page page = pdfDocument.getPages().get_Item(1);

	// Create an annotation selector
	com.aspose.pdf.AnnotationSelector annotationSelector = new com.aspose.pdf.AnnotationSelector(
	new com.aspose.pdf.SquareAnnotation(page, com.aspose.pdf.Rectangle.getTrivial()));

	// Get list of annotations
	java.util.List<com.aspose.pdf.Annotation> list = annotationSelector.getSelected();
	if (list.size() == 0) {
	System.out.println("Marked tables not found..");
	return;
	}

	// Create a square annotation
	com.aspose.pdf.SquareAnnotation squareAnnotation = (com.aspose.pdf.SquareAnnotation) list.get(0);

	// Create table absorber and visit page
	com.aspose.pdf.TableAbsorber absorber = new com.aspose.pdf.TableAbsorber();
	absorber.visit(page);

	// Iterate through tables
	for (com.aspose.pdf.AbsorbedTable table : absorber.getTableList()) {
	{
	// Check is table exists
	boolean isInRegion = (squareAnnotation.getRect().getLLX() < table.getRectangle().getLLX())
	&& (squareAnnotation.getRect().getLLY() < table.getRectangle().getLLY())
	&& (squareAnnotation.getRect().getURX() > table.getRectangle().getURX())
	&& (squareAnnotation.getRect().getURY() > table.getRectangle().getURY());

	if (isInRegion) {
	for (com.aspose.pdf.AbsorbedRow row : table.getRowList()) {
	{
	for (com.aspose.pdf.AbsorbedCell cell : row.getCellList()) {
	for (com.aspose.pdf.TextFragment fragment : cell.getTextFragments()) {
	StringBuilder sb = new StringBuilder();
	for (com.aspose.pdf.TextSegment seg : fragment.getSegments())
	sb.append(seg.getText());
	System.out.print(sb.toString() + "\|");
	}
	}
	System.out.println();
	}
	}
	}
	}
	}

view raw extract-text-from-pdf-table-specific-area.java hosted with ❤ by GitHub

Free Java PDF Library to Extract Tables

You can get a free temporary license to extract tables from a PDF file without any limitations.

Explore Java PDF Library

You can learn more about Java PDF library using the documentation. In addition, you can ask your questions via our forum.

Conclusion

In this article, you have learned how to extract tables in a PDF using Java. Furthermore, the article covered how to extract tables from a specific region of the page in PDF. You can easily use our PDF library and the provided code samples in your application.

Java Library to Extract PDF Table#

Extract Tables from a PDF in Java#

Java: Extract PDF Table from a Specific Area#

Free Java PDF Library to Extract Tables#

Explore Java PDF Library#

Conclusion#

See Also#