Extract Data from Tables in PDF using C#

PDF has become one of the most widely used document formats in a multitude of fields. In various cases, it is used to generate invoices where data appears to be in a tabular form. In such cases, you may need to parse the PDF to read data from the tables programmatically. To achieve this, the article covers how to extract data from PDF tables using C#.

C# Library to Extract PDF Tables
Extract Data from PDF Tables in C#
Extract Table from a Specific Area of Page

C# Library to Extract Tables from PDF

To extract data from the tables in PDF files, we will use Aspose.PDF for .NET. It is a powerful library that provides a wide range of PDF manipulation features. You can either download the library or install it using NuGet.

PM> Install-Package Aspose.PDF

Extract Data from PDF Tables in C#

The following are the steps to extract data from tables in a PDF using C#.

Load the PDF document using the Document class.
Loop through the pages in PDF using Document.Pages collection.
In each iteration, initialize the TableAbsorber object and visit the selected page using TableAbsorber.Visit(Page) method.
In a nested loop, iterate through the list of the tables in TableAbsorber.TableList collection.
For each AbsorbedTable in the collection, iterate through the collection of rows in AbsorbedTable.RowList.
For each AbsorbedRow in the collection, iterate through the collection of cells in AbsorbedRow.CellList.
Finally, loop through the TextFragments collection of each AbsorbedCell and print the text.

The following code sample shows how to extract text from PDF table in C#.

Extract Table from a Specific Area of PDF

The following are the steps to extract a table from a specific part of the page in a PDF using C#.

Load the PDF document using the Document class.
Select the desired Page from Document.Pages collection.
Extract the Square annotation of the page.
Initialize the TableAbsorber object and visit the page using TableAbsorber.Visit(Page) method.
In a nested loop, iterate through the list of the tables in TableAbsorber.TableList collection.
If the table is in the region then perform the following steps.
- Iterate through the collection of rows in AbsorbedTable.RowList.
- For each AbsorbedRow in the collection, iterate through the collection of cells in AbsorbedRow.CellList.
- Finally, loop through the TextFragments collection of each AbsorbedCell and print the text.

The following code sample shows how to extract table from a specific region of the PDF page.

Free C# PDF Library to Extract Tables

You can get a free temporary license and extract data from tables in PDF files without any limitations.

Explore C# PDF Library

You can explore more about the C# PDF library using the documentation. Also, you can post your queries on our forum.

Conclusion

In this article, you have learned how to extract data from tables in a PDF using C#. Furthermore, you have seen how to extract a table from a specific region of the page in PDF. The step-by-step guide and code sample have shown how easy it is to process and PDF table and extract its data.

C# Library to Extract Tables from PDF#

Extract Data from PDF Tables in C##

Extract Table from a Specific Area of PDF#

Free C# PDF Library to Extract Tables#

Explore C# PDF Library#

Conclusion#

See Also#