OCR PDF and Extract Text from PDF in C#

A PDF file is one of the most common business documents. In certain cases, we may need to read scanned PDF documents programmatically. The difficulty of extracting text from scanned PDF files has led to the development of tools that make it easier to read and retrieve text from such PDF documents. Depending on the content of your document, extracting text from PDF files can be useful for a number of reasons. In this article, we will learn how to OCR PDF documents and Extract text from PDF in C#.

The following topics will be covered in this article:

OCR PDF to Text C# API
OCR PDF and Extract Text from PDF
Perform OCR on PDF and Save Text
OCR PDF to Word File
OCR PDF to JSON

OCR PDF to Text C# API

We will be using the Aspose.OCR for .NET API to perform OCR on PDF document. It can recognize scanned images, smartphone photos, screenshots, and areas of images. The API returns recognized text results in the most popular document and data exchange formats. In addition to converting images to text, the API can also create searchable PDFs based on scans. Moreover, it is capable of autocorrecting spelling mistakes in recognized texts.

The API provides the AsposeOcr class that provides various methods to perform OCR operations. It provides the RecognizePdf(string, DocumentRecognitionSettings) method for recognizing the text from the provided PDF document. The DocumentRecognitionSettings class of the API provides settings for the PDF recognition process. The RecognitionResult class represents the results of the image recognition.

Please either download the DLL of the API or install it using NuGet.

PM> Install-Package Aspose.OCR

OCR PDF and Extract Text from PDF in C#

We can perform OCR on PDF documents and extract the recognized text by following the steps given below:

Firstly, create an instance of the AsposeOcr class.
Next, initialize an object of the DocumentRecognitionSettings class.
Then, specify the language to be used for OCR.
After that, get the RecognitionResult by calling the RecognizePdf() method. It takes the image path and DocumentRecognitionSettings object as arguments.
Finally, loop through the RecognitionResult list and show the identified text.

The following sample code shows how to OCR PDF documents and extract the recognized text in C#.

	// This code example demonstrates how to OCR PDF documents and extract the recognized text.
	// Initialize the PCR engine
	AsposeOcr recognitionEngine = new AsposeOcr();

	// Initialize recognition settings
	DocumentRecognitionSettings recognitionSettings = new DocumentRecognitionSettings();

	// Specify language for OCR. Multi-language by default
	recognitionSettings.Language = Language.Eng;

	// Recognize text from PDF
	List<RecognitionResult> results = recognitionEngine.RecognizePdf("C:\\Files\\sample.pdf", recognitionSettings);

	// Show the recognized text
	foreach (RecognitionResult result in results)
	{
	Console.WriteLine(result.RecognitionText);
	}

view raw OCR-PDF-in-CSharp_ExtractText.cs hosted with ❤ by GitHub

Perform OCR on PDF and Save Text in C#

We can perform OCR on PDF documents and save the recognized text by following the steps given below:

Firstly, create an instance of the AsposeOcr class.
Next, initialize an object of the DocumentRecognitionSettings class.
Then, specify the language to be used for OCR.
After that, call the RecognizePdf() method to get the RecognitionResult. It takes the image path and DocumentRecognitionSettings object as arguments.
Finally, save the text using the SaveMultipageDocument() method. It takes the output file path, the SaveFormat and RecognitionResult object as arguments.

The following sample code shows how to OCR PDF documents and save the recognized text in C#.

	// This code example demonstrates how to OCR PDF documents and extract the recognized text.
	// Initialize the PCR engine
	AsposeOcr recognitionEngine = new AsposeOcr();

	// Initialize recognition settings
	DocumentRecognitionSettings recognitionSettings = new DocumentRecognitionSettings();

	// Specify language for OCR. Multi-language by default
	recognitionSettings.Language = Language.Eng;

	// Recognize text from PDF
	List<RecognitionResult> results = recognitionEngine.RecognizePdf("C:\\Files\\sample.pdf", recognitionSettings);

	// Save the recognized text
	AsposeOcr.SaveMultipageDocument("C:\\Files\\OCR_result.txt", SaveFormat.Text, results);

view raw OCR-PDF-in-CSharp_SaveText.cs hosted with ❤ by GitHub

OCR PDF and Convert Scanned PDF to Word in C#

We can perform OCR on scanned PDF documents and save the recognized text in Word document by following the steps mentioned earlier. However, we just need to specify the SaveFormat.Docx in the last step.

The following sample code shows how to OCR PDF and save the recognized text as a Word document in C#.

	// This code example demonstrates how to OCR PDF documents and save the recognized text as DOCX.
	// Initialize the PCR engine
	AsposeOcr recognitionEngine = new AsposeOcr();

	// Initialize recognition settings
	DocumentRecognitionSettings recognitionSettings = new DocumentRecognitionSettings();

	// Specify language for OCR. Multi-language by default
	recognitionSettings.Language = Language.Eng;

	// Recognize text from PDF
	List<RecognitionResult> results = recognitionEngine.RecognizePdf("C:\\Files\\sample.pdf", recognitionSettings);

	// Save the recognized text as DOCX
	AsposeOcr.SaveMultipageDocument("C:\\Files\\OCR_result.docx", SaveFormat.Docx, results);

view raw OCR-PDF-in-CSharp_SaveTextAsDOCX.cs hosted with ❤ by GitHub

OCR PDF and Convert Scanned PDF to Word in C#

OCR PDF and Convert PDF to JSON in C#

We can perform OCR on PDF documents and save the recognized text in a JSON file by following the steps mentioned earlier. However, we just need to specify the SaveFormat.Json in the last step.

The following sample code shows how to OCR PDF and save the recognized text as a JSON file in C#.

	// This code example demonstrates how to OCR PDF documents and save the recognized text as JSON.
	// Initialize the PCR engine
	AsposeOcr recognitionEngine = new AsposeOcr();

	// Initialize recognition settings
	DocumentRecognitionSettings recognitionSettings = new DocumentRecognitionSettings();

	// Specify language for OCR. Multi-language by default
	recognitionSettings.Language = Language.Eng;

	// Recognize text from PDF
	List<RecognitionResult> results = recognitionEngine.RecognizePdf("C:\\Files\\sample.pdf", recognitionSettings);

	// Save the recognized text as JSON
	AsposeOcr.SaveMultipageDocument("C:\\Files\\OCR_result.json", SaveFormat.Json, results);

view raw OCR-PDF-in-CSharp_SaveTextAsJSON.cs hosted with ❤ by GitHub

Get a Free Evaluation License

You can get a free temporary license to try the library without evaluation limitations.

Conclusion

In this article, we have learned how to perform OCR on PDF documents and extract text from PDF in C#. We have also seen how to save the recognized text as a TXT, DOCX, and JSON file. Besides, you can learn more about Aspose.OCR for .NET API using documentation. In case of any ambiguity, please feel free to contact us on our forum.

OCR PDF to Text C# API#

OCR PDF and Extract Text from PDF in C##

Perform OCR on PDF and Save Text in C##

OCR PDF and Convert Scanned PDF to Word in C##

OCR PDF and Convert PDF to JSON in C##

Get a Free Evaluation License#

Conclusion#

See Also#

OCR PDF to Text C# API

OCR PDF and Extract Text from PDF in C#

Perform OCR on PDF and Save Text in C#

OCR PDF and Convert Scanned PDF to Word in C#

OCR PDF and Convert PDF to JSON in C#

Get a Free Evaluation License

Conclusion

See Also