Extract Data from Tables in PDF Files using C++

PDF files are a standard format for exchanging documents over the internet. Documents like invoices and product guides are usually shared in the form of PDFs. There might be situations where you have multiple invoices containing tabular data that you need to extract and process further. It will be more efficient to extract this data programmatically. To that end, this article will teach you how to extract data from PDF tables using C++.

C++ API for Extracting Data from Tables in PDF Files
Extract Data from PDF Tables using C++
Extract Data from a Table in a Specific Area of a PDF Page

C++ API for Extracting Data from Tables in PDF Files

Aspose.PDF for C++ is a C++ library that allows you to create, read, and update PDF files. Furthermore, the API supports extracting data from tables in PDF files. You can either install the API through NuGet or download it directly from the downloads section.

PM> Install-Package Aspose.PDF.Cpp

Extract Data from PDF Tables using C++

The following are the steps to extract data from PDF tables.

Load the PDF document using the Document class.
Iterate through the pages of the document using the Document->get_Pages() method.
In each iteration, create an instance of the TableAbsorber class and specify the page for extracting tables using the TableAbsorber->Visit(System::SharedPtr page) method.
Get the tables using the TableAbsorber->get_TableList() method and iterate over them.
For each AbsorbedTable, iterate though the rows using the AbsorbedTable->get_RowList() method.
For each AbsorbedRow, iterate through the cells using the AbsorbedRow->get_CellList() method.
Get TextFragmentCollection for each AbsorbedCell using the AbsorbedCell->get_TextFragments() method and loop through it.
Get TextSegmentCollection for each TextFragment using the TextFragment->get_Segments() method and loop through it.
Retrieve the text from each TextSegment and print it.

The following sample code shows how to extract data from PDF tables using C++.

	// Load the PDF document
	auto pdfDocument = MakeObject<Document>(u"SourceDirectory\\PDF\\Table_input3.pdf");

	// Iterate through the pages of the document
	for (auto page : pdfDocument->get_Pages())
	{
	// Create an instance of the TableAbsorber class
	auto absorber = MakeObject<TableAbsorber>();
	absorber->Visit(page);

	// Iterate through the tables
	for (auto table : absorber->get_TableList())
	{
	Console::WriteLine(u"Table");

	// Iterate through the rows
	for (auto row : table->get_RowList())
	{
	// Iterate throught the cells
	for (auto cell : row->get_CellList())
	{
	// Iterate throught the text fragments
	for (auto fragment : cell->get_TextFragments())
	{
	String string = u"";

	// Iterate through the text segments
	for (auto seg : fragment->get_Segments())
	{
	// Get the text
	string = String::Concat(string, seg->get_Text());
	}

	// Print the text
	Console::WriteLine(string);
	}
	}
	Console::WriteLine();
	}
	}
	}

view raw Extract_Text_From_PDF_Table.cpp hosted with ❤ by GitHub

Extract Data from a Table in a Specific Area of a PDF Page

In order to extract data from a table in a specific area of a PDF page, please follow the steps given below.

Load the PDF document using the Document class.
Retrieve the page containing the table using the Document->get_Pages()->idx_get(int32_t index) method.
Loop through the annotations and get the square annotation.
Create an instance of the TableAbsorber class and specify the page for extracting tables using the TableAbsorber->Visit(System::SharedPtr page) method.
Get the tables using the TableAbsorber->get_TableList() method and iterate over them.
If the table is in the region, perform the following steps:
- Iterate through the rows of the AbsorbedTable using the AbsorbedTable->get_RowList() method.
- For each AbsorbedRow, iterate through the cells using the AbsorbedRow->get_CellList() method.
- Get TextFragmentCollection for each AbsorbedCell using the AbsorbedCell->get_TextFragments() method and loop through it.
- Get TextSegmentCollection for each TextFragment using the TextFragment->get_Segments() method and loop through it.
- Retrieve the text from each TextSegment and print it.

The following sample code demonstrates how to extract data from a table in a specific area of a PDF page using C++.

	// Load the PDF document
	auto pdfDocument = MakeObject<Document>(u"SourceDirectory\\PDF\\Table_input4.pdf");

	// Get the first page of the document
	auto page = pdfDocument->get_Pages()->idx_get(1);

	// Iterate through the annotations on the page
	for (auto annotation : page->get_Annotations())
	{
	// Check the annotation type
	if (annotation->get_AnnotationType() == Annotations::AnnotationType::Square)
	{
	System::SharedPtr<SquareAnnotation> squareAnnotation = DynamicCast<SquareAnnotation>(annotation);

	// Create an instance of the TableAbsorber class
	auto absorber = MakeObject<TableAbsorber>();
	absorber->Visit(page);

	// Iterate through the tables
	for (auto table : absorber->get_TableList())
	{
	// Check if the table is in the region
	if ((squareAnnotation->get_Rect()->get_LLX() < table->get_Rectangle()->get_LLX()) &&
	(squareAnnotation->get_Rect()->get_LLY() < table->get_Rectangle()->get_LLY()) &&
	(squareAnnotation->get_Rect()->get_URX() > table->get_Rectangle()->get_URX()) &&
	(squareAnnotation->get_Rect()->get_URY() > table->get_Rectangle()->get_URY())
	)
	{
	// Iterate through the rows
	for (auto row : table->get_RowList())
	{
	// Iterate throught the cells
	for (auto cell : row->get_CellList())
	{
	// Iterate throught the text fragments
	for (auto fragment : cell->get_TextFragments())
	{
	String string = u"";

	// Iterate through the text segments
	for (auto seg : fragment->get_Segments())
	{
	// Get the text
	string = String::Concat(string, seg->get_Text());
	}

	// Print the text
	Console::WriteLine(string);
	}
	}
	Console::WriteLine();
	}
	}
	}
	break;
	}
	}

view raw Extract_Text_From_PDF_Table_Specific_Area.cpp hosted with ❤ by GitHub

Get a Free License

In order to try the API without evaluation limitations, you can request a free temporary license.

Conclusion

In this article, you have learned how to extract data from PDF tables using C++. Moreover, you have learned how to extract data from a table in a specific region of the PDF page. Aspose.PDF for C++ API provides many additional features for working with PDF files. You can explore the API in detail by visiting the official documentation. In case of any questions, please feel free to reach us on our free support forum.

C++ API for Extracting Data from Tables in PDF Files#

Extract Data from PDF Tables using C++#

Extract Data from a Table in a Specific Area of a PDF Page#

Get a Free License#

Conclusion#

See Also#

C++ API for Extracting Data from Tables in PDF Files

Extract Data from PDF Tables using C++

Extract Data from a Table in a Specific Area of a PDF Page

Get a Free License

Conclusion

See Also