Convert HTML to Plain Text or Extract Text from HTML in C#

Convert Extract HTML Text

HTML is a markup language that is popularly used in websites, web applications, etc. Contents in HTML are differentiated using different tags. You can easily convert HTML to plain text using C# language while ignoring the opening and closing tags. Likewise, there could be some cascading style sheets specifications, JavaScript, or any other section of the HTML as per your requirements. Therefore, it is like Extracting text from an HTML document. Let us walk through the following section for detailed information:

HTML to Plain Text Converter in C#

You can convert HTML to Plain Text with Aspose.HTML for .NET API. It can easily be configured in .NET Framework-based environment using the following NuGet installation command:

PM> Install-Package Aspose.Html

Convert HTML to Text File using INodeIterator in C#

Aspose.HTML for .NET API is based on DOM (Document Object Model) so you can iterate through different nodes of the API using the INodeIterator interface. Likewise, it gives you the control to define a NodeFilter and override methods for filtering contents. You can ignore the style, script, or any other element in the HTML file to pick only the text string out of it. Below are the steps to convert HTML to Plain TXT File using C#:

  1. Read input HTML file
  2. Initialize the instance of node iterator
  3. Create INodeIterator instance
  4. Check for Style Filter
  5. Read Node value in a string
  6. Write Text contents of HTML as TXT file

The code below shows how to convert HTML to Plain Text file using C#:

Extract Text from HTML with Different Approaches using C#

We have learned how to convert HTML to Text using INodeIterator. Likewise, there are several other approaches to achieve these requirements. You can create a custom method to extract text, or work with TextContent property as explained in the steps below:

  1. Load input HTML document
  2. Define a user-defined method
  3. Check each NodeType to see if it’s an element node or text node
  4. Get text elements using TextContent property
  5. Save output TXT file

The code snippet below explains these two different approaches for converting HTML to Plain Text in C#:

Convert URL Webpage HTML to Text using C#

We have already covered the offline webpage HTML to text conversion where the HTML file is required to exist by downloading it on the disk. Let us move another step further where you do not even need to save the HTML file on the disk. You can directly convert a webpage by specifying its URL, to a TXT file using C#. For instance, let us convert the product page of Aspose.HTML for .NET API to a TXT file with below steps:

  1. Initialize HTMLDocument object and specify the URL
  2. Read the text contents of the HTML format
  3. Write the TXT file with extracted text from webpage

The code below shows how to convert URL Webpage HTML to Text using C#:

Conclusion

In this article, we have discussed how to convert HTML files to Plain Text, extract text from HTML, as well as an online webpage URL to TXT file format using C#. You can use different approaches and feel free to get in touch with us via the Free Support Forum if you need any further assistance.

See Also

Convert HTML Webpage to Word File (DOCX/DOC) using C#