PDF files are a standard format for exchanging documents over the internet. Documents like invoices and product guides are usually shared in the form of PDFs. There might be situations where you have multiple invoices containing tabular data that you need to extract and process further. It will be more efficient to extract this data programmatically. To that end, this article will teach you how to extract data from PDF tables using C++.
- C++ API for Extracting Data from Tables in PDF Files
- Extract Data from PDF Tables using C++
- Extract Data from a Table in a Specific Area of a PDF Page
C++ API for Extracting Data from Tables in PDF Files
Aspose.PDF for C++ is a C++ library that allows you to create, read, and update PDF files. Furthermore, the API supports extracting data from tables in PDF files. You can either install the API through NuGet or download it directly from the downloads section.
PM> Install-Package Aspose.PDF.Cpp
Extract Data from PDF Tables using C++
The following are the steps to extract data from PDF tables.
- Load the PDF document using the Document class.
- Iterate through the pages of the document using the Document->get_Pages() method.
- In each iteration, create an instance of the TableAbsorber class and specify the page for extracting tables using the TableAbsorber->Visit(System::SharedPtr page) method.
- Get the tables using the TableAbsorber->get_TableList() method and iterate over them.
- For each AbsorbedTable, iterate though the rows using the AbsorbedTable->get_RowList() method.
- For each AbsorbedRow, iterate through the cells using the AbsorbedRow->get_CellList() method.
- Get TextFragmentCollection for each AbsorbedCell using the AbsorbedCell->get_TextFragments() method and loop through it.
- Get TextSegmentCollection for each TextFragment using the TextFragment->get_Segments() method and loop through it.
- Retrieve the text from each TextSegment and print it.
The following sample code shows how to extract data from PDF tables using C++.
Extract Data from a Table in a Specific Area of a PDF Page
In order to extract data from a table in a specific area of a PDF page, please follow the steps given below.
- Load the PDF document using the Document class.
- Retrieve the page containing the table using the Document->get_Pages()->idx_get(int32_t index) method.
- Loop through the annotations and get the square annotation.
- Create an instance of the TableAbsorber class and specify the page for extracting tables using the TableAbsorber->Visit(System::SharedPtr page) method.
- Get the tables using the TableAbsorber->get_TableList() method and iterate over them.
- If the table is in the region, perform the following steps:
- Iterate through the rows of the AbsorbedTable using the AbsorbedTable->get_RowList() method.
- For each AbsorbedRow, iterate through the cells using the AbsorbedRow->get_CellList() method.
- Get TextFragmentCollection for each AbsorbedCell using the AbsorbedCell->get_TextFragments() method and loop through it.
- Get TextSegmentCollection for each TextFragment using the TextFragment->get_Segments() method and loop through it.
- Retrieve the text from each TextSegment and print it.
The following sample code demonstrates how to extract data from a table in a specific area of a PDF page using C++.
Get a Free License
In order to try the API without evaluation limitations, you can request a free temporary license.
Conclusion
In this article, you have learned how to extract data from PDF tables using C++. Moreover, you have learned how to extract data from a table in a specific region of the PDF page. Aspose.PDF for C++ API provides many additional features for working with PDF files. You can explore the API in detail by visiting the official documentation. In case of any questions, please feel free to reach us on our free support forum.