PDF files are a standard format for exchanging documents over the internet. Documents like invoices and product guides are usually shared in the form of PDFs. There might be situations where you have multiple invoices containing tabular data that you need to extract and process further. It will be more efficient to extract this data programmatically. To that end, this article will teach you how to extract data from PDF tables using C++.

C++ API for Extracting Data from Tables in PDF Files

Aspose.PDF for C++ is a C++ library that allows you to create, read, and update PDF files. Furthermore, the API supports extracting data from tables in PDF files. You can either install the API through NuGet or download it directly from the downloads section.

PM> Install-Package Aspose.PDF.Cpp

Extract Data from PDF Tables using C++

The following are the steps to extract data from PDF tables.

The following sample code shows how to extract data from PDF tables using C++.

// Load the PDF document
auto pdfDocument = MakeObject<Document>(u"SourceDirectory\\PDF\\Table_input3.pdf");
// Iterate through the pages of the document
for (auto page : pdfDocument->get_Pages())
{
// Create an instance of the TableAbsorber class
auto absorber = MakeObject<TableAbsorber>();
absorber->Visit(page);
// Iterate through the tables
for (auto table : absorber->get_TableList())
{
Console::WriteLine(u"Table");
// Iterate through the rows
for (auto row : table->get_RowList())
{
// Iterate throught the cells
for (auto cell : row->get_CellList())
{
// Iterate throught the text fragments
for (auto fragment : cell->get_TextFragments())
{
String string = u"";
// Iterate through the text segments
for (auto seg : fragment->get_Segments())
{
// Get the text
string = String::Concat(string, seg->get_Text());
}
// Print the text
Console::WriteLine(string);
}
}
Console::WriteLine();
}
}
}

Extract Data from a Table in a Specific Area of a PDF Page

In order to extract data from a table in a specific area of a PDF page, please follow the steps given below.

The following sample code demonstrates how to extract data from a table in a specific area of a PDF page using C++.

// Load the PDF document
auto pdfDocument = MakeObject<Document>(u"SourceDirectory\\PDF\\Table_input4.pdf");
// Get the first page of the document
auto page = pdfDocument->get_Pages()->idx_get(1);
// Iterate through the annotations on the page
for (auto annotation : page->get_Annotations())
{
// Check the annotation type
if (annotation->get_AnnotationType() == Annotations::AnnotationType::Square)
{
System::SharedPtr<SquareAnnotation> squareAnnotation = DynamicCast<SquareAnnotation>(annotation);
// Create an instance of the TableAbsorber class
auto absorber = MakeObject<TableAbsorber>();
absorber->Visit(page);
// Iterate through the tables
for (auto table : absorber->get_TableList())
{
// Check if the table is in the region
if ((squareAnnotation->get_Rect()->get_LLX() < table->get_Rectangle()->get_LLX()) &&
(squareAnnotation->get_Rect()->get_LLY() < table->get_Rectangle()->get_LLY()) &&
(squareAnnotation->get_Rect()->get_URX() > table->get_Rectangle()->get_URX()) &&
(squareAnnotation->get_Rect()->get_URY() > table->get_Rectangle()->get_URY())
)
{
// Iterate through the rows
for (auto row : table->get_RowList())
{
// Iterate throught the cells
for (auto cell : row->get_CellList())
{
// Iterate throught the text fragments
for (auto fragment : cell->get_TextFragments())
{
String string = u"";
// Iterate through the text segments
for (auto seg : fragment->get_Segments())
{
// Get the text
string = String::Concat(string, seg->get_Text());
}
// Print the text
Console::WriteLine(string);
}
}
Console::WriteLine();
}
}
}
break;
}
}

Get a Free License

In order to try the API without evaluation limitations, you can request a free temporary license.

Conclusion

In this article, you have learned how to extract data from PDF tables using C++. Moreover, you have learned how to extract data from a table in a specific region of the PDF page. Aspose.PDF for C++ API provides many additional features for working with PDF files. You can explore the API in detail by visiting the official documentation. In case of any questions, please feel free to reach us on our free support forum.

See Also