PDF files are a standard format for exchanging documents over the internet. Documents like invoices and product guides are usually shared in the form of PDFs. There might be situations where you have multiple invoices containing tabular data that you need to extract and process further. It will be more efficient to extract this data programmatically. To that end, this article will teach you how to extract data from PDF tables using C++.
- C++ API for Extracting Data from Tables in PDF Files
- Extract Data from PDF Tables using C++
- Extract Data from a Table in a Specific Area of a PDF Page
C++ API for Extracting Data from Tables in PDF Files
Aspose.PDF for C++ is a C++ library that allows you to create, read, and update PDF files. Furthermore, the API supports extracting data from tables in PDF files. You can either install the API through NuGet or download it directly from the downloads section.
PM> Install-Package Aspose.PDF.Cpp
Extract Data from PDF Tables using C++
The following are the steps to extract data from PDF tables.
- Load the PDF document using the Document class.
- Iterate through the pages of the document using the Document->get_Pages() method.
- In each iteration, create an instance of the TableAbsorber class and specify the page for extracting tables using the TableAbsorber->Visit(System::SharedPtr page) method.
- Get the tables using the TableAbsorber->get_TableList() method and iterate over them.
- For each AbsorbedTable, iterate though the rows using the AbsorbedTable->get_RowList() method.
- For each AbsorbedRow, iterate through the cells using the AbsorbedRow->get_CellList() method.
- Get TextFragmentCollection for each AbsorbedCell using the AbsorbedCell->get_TextFragments() method and loop through it.
- Get TextSegmentCollection for each TextFragment using the TextFragment->get_Segments() method and loop through it.
- Retrieve the text from each TextSegment and print it.
The following sample code shows how to extract data from PDF tables using C++.
// Load the PDF document | |
auto pdfDocument = MakeObject<Document>(u"SourceDirectory\\PDF\\Table_input3.pdf"); | |
// Iterate through the pages of the document | |
for (auto page : pdfDocument->get_Pages()) | |
{ | |
// Create an instance of the TableAbsorber class | |
auto absorber = MakeObject<TableAbsorber>(); | |
absorber->Visit(page); | |
// Iterate through the tables | |
for (auto table : absorber->get_TableList()) | |
{ | |
Console::WriteLine(u"Table"); | |
// Iterate through the rows | |
for (auto row : table->get_RowList()) | |
{ | |
// Iterate throught the cells | |
for (auto cell : row->get_CellList()) | |
{ | |
// Iterate throught the text fragments | |
for (auto fragment : cell->get_TextFragments()) | |
{ | |
String string = u""; | |
// Iterate through the text segments | |
for (auto seg : fragment->get_Segments()) | |
{ | |
// Get the text | |
string = String::Concat(string, seg->get_Text()); | |
} | |
// Print the text | |
Console::WriteLine(string); | |
} | |
} | |
Console::WriteLine(); | |
} | |
} | |
} |
Extract Data from a Table in a Specific Area of a PDF Page
In order to extract data from a table in a specific area of a PDF page, please follow the steps given below.
- Load the PDF document using the Document class.
- Retrieve the page containing the table using the Document->get_Pages()->idx_get(int32_t index) method.
- Loop through the annotations and get the square annotation.
- Create an instance of the TableAbsorber class and specify the page for extracting tables using the TableAbsorber->Visit(System::SharedPtr page) method.
- Get the tables using the TableAbsorber->get_TableList() method and iterate over them.
- If the table is in the region, perform the following steps:
- Iterate through the rows of the AbsorbedTable using the AbsorbedTable->get_RowList() method.
- For each AbsorbedRow, iterate through the cells using the AbsorbedRow->get_CellList() method.
- Get TextFragmentCollection for each AbsorbedCell using the AbsorbedCell->get_TextFragments() method and loop through it.
- Get TextSegmentCollection for each TextFragment using the TextFragment->get_Segments() method and loop through it.
- Retrieve the text from each TextSegment and print it.
The following sample code demonstrates how to extract data from a table in a specific area of a PDF page using C++.
// Load the PDF document | |
auto pdfDocument = MakeObject<Document>(u"SourceDirectory\\PDF\\Table_input4.pdf"); | |
// Get the first page of the document | |
auto page = pdfDocument->get_Pages()->idx_get(1); | |
// Iterate through the annotations on the page | |
for (auto annotation : page->get_Annotations()) | |
{ | |
// Check the annotation type | |
if (annotation->get_AnnotationType() == Annotations::AnnotationType::Square) | |
{ | |
System::SharedPtr<SquareAnnotation> squareAnnotation = DynamicCast<SquareAnnotation>(annotation); | |
// Create an instance of the TableAbsorber class | |
auto absorber = MakeObject<TableAbsorber>(); | |
absorber->Visit(page); | |
// Iterate through the tables | |
for (auto table : absorber->get_TableList()) | |
{ | |
// Check if the table is in the region | |
if ((squareAnnotation->get_Rect()->get_LLX() < table->get_Rectangle()->get_LLX()) && | |
(squareAnnotation->get_Rect()->get_LLY() < table->get_Rectangle()->get_LLY()) && | |
(squareAnnotation->get_Rect()->get_URX() > table->get_Rectangle()->get_URX()) && | |
(squareAnnotation->get_Rect()->get_URY() > table->get_Rectangle()->get_URY()) | |
) | |
{ | |
// Iterate through the rows | |
for (auto row : table->get_RowList()) | |
{ | |
// Iterate throught the cells | |
for (auto cell : row->get_CellList()) | |
{ | |
// Iterate throught the text fragments | |
for (auto fragment : cell->get_TextFragments()) | |
{ | |
String string = u""; | |
// Iterate through the text segments | |
for (auto seg : fragment->get_Segments()) | |
{ | |
// Get the text | |
string = String::Concat(string, seg->get_Text()); | |
} | |
// Print the text | |
Console::WriteLine(string); | |
} | |
} | |
Console::WriteLine(); | |
} | |
} | |
} | |
break; | |
} | |
} |
Get a Free License
In order to try the API without evaluation limitations, you can request a free temporary license.
Conclusion
In this article, you have learned how to extract data from PDF tables using C++. Moreover, you have learned how to extract data from a table in a specific region of the PDF page. Aspose.PDF for C++ API provides many additional features for working with PDF files. You can explore the API in detail by visiting the official documentation. In case of any questions, please feel free to reach us on our free support forum.