A PDF file is one of the most common business documents. In certain cases, we may need to read scanned PDF documents programmatically. The difficulty of extracting text from scanned PDF files has led to the development of tools that make it easier to read and retrieve text from such PDF documents. Depending on the content of your document, extracting text from PDF files can be useful for a number of reasons. In this article, we will learn how to OCR PDF documents and Extract text from PDF in C#.
The following topics will be covered in this article:
- OCR PDF to Text C# API
- OCR PDF and Extract Text from PDF
- Perform OCR on PDF and Save Text
- OCR PDF to Word File
- OCR PDF to JSON
OCR PDF to Text C# API
We will be using the Aspose.OCR for .NET API to perform OCR on PDF document. It can recognize scanned images, smartphone photos, screenshots, and areas of images. The API returns recognized text results in the most popular document and data exchange formats. In addition to converting images to text, the API can also create searchable PDFs based on scans. Moreover, it is capable of autocorrecting spelling mistakes in recognized texts.
The API provides the AsposeOcr class that provides various methods to perform OCR operations. It provides the RecognizePdf(string, DocumentRecognitionSettings) method for recognizing the text from the provided PDF document. The DocumentRecognitionSettings class of the API provides settings for the PDF recognition process. The RecognitionResult class represents the results of the image recognition.
Please either download the DLL of the API or install it using NuGet.
PM> Install-Package Aspose.OCR
OCR PDF and Extract Text from PDF in C#
We can perform OCR on PDF documents and extract the recognized text by following the steps given below:
- Firstly, create an instance of the AsposeOcr class.
- Next, initialize an object of the DocumentRecognitionSettings class.
- Then, specify the language to be used for OCR.
- After that, get the RecognitionResult by calling the RecognizePdf() method. It takes the image path and DocumentRecognitionSettings object as arguments.
- Finally, loop through the RecognitionResult list and show the identified text.
The following sample code shows how to OCR PDF documents and extract the recognized text in C#.
Perform OCR on PDF and Save Text in C#
We can perform OCR on PDF documents and save the recognized text by following the steps given below:
- Firstly, create an instance of the AsposeOcr class.
- Next, initialize an object of the DocumentRecognitionSettings class.
- Then, specify the language to be used for OCR.
- After that, call the RecognizePdf() method to get the RecognitionResult. It takes the image path and DocumentRecognitionSettings object as arguments.
- Finally, save the text using the SaveMultipageDocument() method. It takes the output file path, the SaveFormat and RecognitionResult object as arguments.
The following sample code shows how to OCR PDF documents and save the recognized text in C#.
OCR PDF and Convert Scanned PDF to Word in C#
We can perform OCR on scanned PDF documents and save the recognized text in Word document by following the steps mentioned earlier. However, we just need to specify the SaveFormat.Docx in the last step.
The following sample code shows how to OCR PDF and save the recognized text as a Word document in C#.
OCR PDF and Convert PDF to JSON in C#
We can perform OCR on PDF documents and save the recognized text in a JSON file by following the steps mentioned earlier. However, we just need to specify the SaveFormat.Json in the last step.
The following sample code shows how to OCR PDF and save the recognized text as a JSON file in C#.
Get a Free Evaluation License
You can get a free temporary license to try the library without evaluation limitations.
Conclusion
In this article, we have learned how to perform OCR on PDF documents and extract text from PDF in C#. We have also seen how to save the recognized text as a TXT, DOCX, and JSON file. Besides, you can learn more about Aspose.OCR for .NET API using documentation. In case of any ambiguity, please feel free to contact us on our forum.