Extract Highlighted Text from PDF Documents and Remove all Text from PDF Pages

Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedIn

As per regular monthly update process, we are pleased to announce new release of Aspose.PDF for .NET. Aspose.PDF for .NET 18.6 has been published and available on NuGet Gallery, to be used in .NET Applications. In Aspose.PDF for .NET 18.6 we have introduced new features related to text manipulations and PDF/UA validation. Along with that, we have also made some fixes to the bugs, reported in earlier versions of the API. An overview of public API changes and improvement in this release of the API, has been given over release notes page of Aspose.PDF for .NET 18.6. In case you are about to use latest version of the API, we strongly recommend you to please check release notes section before downloading the API.

Extract Highlighted Text from PDF Documents

It has been an essential requirement to extract highlighted text from PDF documents. Earlier it was possible to extract text from PDF documents on the basis of some specific regular expressions or by specifying a string to be searched. TextFragmentAbsorber and TextAbsorber classes of the API have been being used quite often and efficiently to serve the purpose.

However, regarding the requirement of extracting highlighted text from PDF document, we have investigated the feature and introduced TextMarkupAnnotation.GetMarkedText() and TextMarkupAnnotation.GetMarkedTextFragments() methods in API. You can extract highlighted text from PDF document by filtering TextMarkupAnnotation and using mentioned methods. An example, demonstrating the feature usage has also been showcased at following link of API documentation:

Remove All Text from PDF Document

While removing text from PDF documents using earlier versions of the API, you needed to set found text as empty string. The performance overhead in this case was, to invoke a number of checks and adjustment operations of text position. Which was why, several performance issues were observed while performing such operations. We could not minimize the number of checks and adjustment operations, as they are essential in text editing scenarios. Moreover, you cannot determine, how many of text fragments will removed and adjusted when they are processed in loop.

In Aspose.PDF for .NET 18.6, new Aspose.Pdf.Operators.TextShowOperator() method has been introduced, in order to remove all text from PDF pages. Therefore, we recommend using this method to remove all text from PDF document, as it surely minimizes the time and works very fast. You can please check code sample in our API documentation, for using this feature of the API:

Important API Changes

In latest release of Aspose.PDF for .NET, all descendants of Aspose.Pdf.Operator were moved into namespace Aspose.Pdf.Operators. Thus ‘new Aspose.Pdf.Operators.GSave()’ should be used, instead of ‘new Aspose.Pdf.Operator.GSave()’. While upgrading to latest version of the API, you will need to upgrade your existing code where you have used previous Aspose.Pdf.Operator namespace.

Miscellaneous

In Aspose.PDF for .NET 18.6, we have also worked for introducing Accessibility Features, thus introduced new features as part of work on 508 compliance (WCAG):

  • PDF/UA validation feature was added.
  • Tagged PDF support was added.

Above listed features are under further development and currently they are available in Aspose.PDF for .NET 18.6, as beta version.

As it is always recommended to use latest releases of our API’s, so we suggest you to please download the latest release Aspose.PDF for .NET 18.6 and check following resources which will help you working with API:


To keep up with our news, you can follow us on Twitter or follow our Facebook page.