PDF format is now used everywhere to generate various types of documents such as invoices, business reports, etc. While processing the PDF files programmatically, you may need to extract text from a PDF. For such cases, this article demonstrates a simple, powerful, and high-speed solution for PDF text extraction. So let’s learn how to extract text from PDF files in C#.
The following is a list of features we will be discussing in this blog:
- Extract Text from a Whole PDF Document using C#
- Search and Extract Text from a Specific Page in PDF Document using C#
- Search and Extract Text from Specific Region of PDF Page using C#
- Search or Extract Specific Text from PDF file with Regular Expressions using C#
- Extract Text from Table Cells in PDF Document using C#
- Search & Extract Highlighted Text from PDF Document using C#
- Extract PDF Text with Low Memory Consumption using C#
C# PDF Text Extraction Library
To extract text from PDF files, we will use Aspose.PDF for .NET. It is a powerful and feature-rich library to generate and process PDF files. You can easily use this library for your PDF manipulation tasks. Also, it lets you extract text from PDF files absolutely free. You can either download its DLL or install it using NuGet.
PM> Install-Package Aspose.PDF
How to Extract Text from a PDF in C#
Reading text contents from a PDF document is a frequently used feature. You can extract all text from all pages of the document in one go. The following are the steps to extract text from a PDF file in C#.
- Load PDF document
- Create TextAbsorber object
- Accept the absorber for all pages
- Write extracted text to a file and save it
Below code snippet follows these steps and shows how to extract text from a whole PDF document in C#.
C#: Search and Extract Text from a Specific Page in PDF
After exploring the text extraction at the document level, let us extract text from a specific page of a PDF document. Simply follow the below steps to perform text extraction for a particular PDF page.
- Load PDF document
- Create TextAbsorber object
- Accept the absorber for a specific page
- Write extracted text to a file and save it
The following code snippet reads and extracts text from a page in a PDF file using C#:
Extract Text from Specific Region of PDF Page
Let us take this text extraction another step further. Consider finding some text from a specific area of the page. It is important to understand here that the basic measuring unit in Aspose.PDF is point, where 72 points equal 1 inch. Moreover, the left bottom corner is considered as the origin (0,0) of the page.
The following are the steps to extract text from a specific area or section of Page in a PDF.
- Load source PDF document
- Create TextAbsorber object
- Set TextSearchOptions and specify rectangular coordinates
- Accept the absorber for specific page
- Write extracted text to a file
The code snippet below demonstrates how to extract text from a specific region of a page in C#.
C# Extract Text from PDF with a Regular Expressions
You can also extract text from a PDF document that matches some pattern. For instance, you might want to extract some specific words or numbers. For this purpose, you would need to design a regular expression. The library will utilize that regex to find matching text in the PDF document.
Below are the steps to search and extract specific text from PDF files using regex.
- Load input PDF document
- Create TextAbsorber object
- Design and set a regular expression to match some text
- Loop through each extracted text fragment
The following C# code snippet uses a regex that searches for the text containing 4 digits, for instance, 1999,2000, etc. across all pages of the document.
Extract Text from a PDF Table in C#
We have been working with TextAbsorber class in previous examples, but extracting text from a Table is a bit different. Therefore, you need to follow the below steps to extract text from a table in a PDF:
- Load the source PDF file
- Create an object of TableAbsorber class
- Iterate through AbsorbedTable
- Loop through each cell in each row of Table
Below code snippet extracts the text from a Table’s cells in PDF using C#.
Extract Highlighted Text from a PDF in C#
Highlighted text is present as Annotation in PDF files. They contain Marked Text which makes them different from the conventional text in a document. The below steps describe how to extract the highlighted text using C#:
- Load source document
- Loop through all the annotations
- Filter TextMarkupAnnotation
- Retrieve highlighted text fragments
The following is a code snippet based on the steps above, it can be used to get highlight text from PDF files in C#.
C# PDF Text Extraction with Low Memory Consumption
The following are two different approaches to optimize memory consumption while extracting text from PDF documents using C# language.
i) Using Reset() and FreeMemory() Method
Sometimes text extraction may consume huge memory and processor. Possibly when the input file is huge and contains a lot of text. It is because TextFragmentAbsorber object stores all found text fragments in the memory.
Therefore, the solution we recommend here is to call absorber.Reset() method after processing each page. Moreover, if you are doing read operations only then you can free the memory held by page objects with page.FreeMemory() method.
The following are the steps to minimize memory consumption in PDF text extraction in C#.
- Load source PDF document
- Specify TextEditOptions
- Read TextFragment
- Call Reset() method
- Call FreeMemory() method
The following code snippet demonstrates how to lower the memory consumption in text extraction from a PDF in C#:
ii) Using MemorySaving enumeration
Aspose.PDF for .NET API supports configuring memory saving mode while reading text from a PDF file. The TextExtractionOptions.TextFormattingMode enumeration serves the purpose of optimal consumption of resources. The following steps summarize the steps you need to follow for this approach:
- Load source PDF file
- Create TextDevice object
- Set TextExtractionOptions
- Extract text and write text to a file
Below code snippet demonstrates the memory-saving approach using C#:
Free C# PDF Text Extraction Library
You can get a free temporary license and extract text from PDF files without any limitations.
Explore C# PDF Library
You can learn more about C# PDF library using documentation. In case of any inquiry, please reach out to us at Free Support Forums.
Conclusion
In this article, you have learned how to extract text from PDF files in C#. With the help of code samples, you have seen how to extract text from a PDF, a page in PDf, extract highlighted text, and much more. You can easily integrate the provided code samples and the C# PDF library into your .NET applications.