Convert PDF to HTML with Embedded Resources in .NET

Share on FacebookTweet about this on TwitterShare on LinkedIn
aspose-pdf-for-net

PDF is one of the most popular document formats these days which is used by a variety of applications as the final output. Due to its support for a wide range of data types and portability, it is the format of choice for creating and sharing content.  As a .NET application developer who is interested in developing document management applications, you may want to embed processing features to read and convert PDF documents to other file formats such as HTML.

In this post, we’ll explore and demonstrate the powerful conversion feature of Aspose.PDF for .NET API to read and convert a PDF file to HTML with several options.

Convert PDF to HTML using C#

Aspose.PDF for .NET API lets you read and convert PDF files to HTML in your .NET applications. It is simple to use and you can get started with the basic conversion using the following simple two lines of code.

// The path to the documents directory.
string dataDir = RunExamples.GetDataDir_AsposePdf_DocumentConversion();

// Open the source PDF document
Document pdfDocument = new Document(dataDir + "PDFToHTML.pdf");

// Save the file into MS document format
pdfDocument.Save(dataDir + "output_out.html", SaveFormat.Html);

It is that simple to convert PDF to HTML in your C# applications. The API takes care of reading all the internal details of PDF file format and converts it to HTML. Interestingly, you don’t need to have PDF reader programs installed at your end or any other computer where your application will finally run.

Convert PDF to HTML with Embedded Resources

You can also convert PDF to HTML with all the resources as part of the output HTML. This will result in making all the elements of a PDF file (images, CSS, and fonts) embedded into the output HTML. This is achieved by using the HtmlSaveOptions.PartsEmbeddingModes enumerator as shown in the following code sample.

// Load source PDF file
Document doc = new Document("input.pdf");
// Instantiate HTML Save options object
HtmlSaveOptions newOptions = new HtmlSaveOptions();

// Enable option to embed all resources inside the HTML
newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;

// This is just optimization for IE and can be omitted 
newOptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
// Output file path 
string outHtmlFile = "SingleHTML_out.html";
doc.Save(outHtmlFile, newOptions);

Saving Images to Specific Folder

Everyone knows that a PDF document can contain images in addition to textual details. An HTML can contain images that are based-64 encoded inside the HTML or can reference images from a folder where these images reside. Aspose.PDF API has rich features of saving images to user-specified folder on disc. The following code sample shows how to save images to a specific folder during conversion of PDF to HTML.

// Create HtmlSaveOption with tested feature
HtmlSaveOptions newOptions = new HtmlSaveOptions();

// Specify the separate folder to save images
newOptions.SpecialFolderForAllImages = dataDir;

Convert PDF to Multipage HTML

The API doesn’t stop you here as it has a lot of options to control the resultant HTML. For example, you can split the HTML in the above step into multiple pages during PDF to HTML conversion using the following sample code.

// The path to the documents directory.
string dataDir = RunExamples.GetDataDir_AsposePdf_DocumentConversion();

// Open the source PDF document
Document pdfDocument = new Document(dataDir + "PDFToHTML.pdf");

// Instantiate HTML SaveOptions object
HtmlSaveOptions htmlOptions = new HtmlSaveOptions();

// Specify to split the output into multiple pages
htmlOptions.SplitIntoPages = true;

// Save the document
pdfDocument.Save(@"MultiPageHTML_out.html", htmlOptions);

Setting the SplitIntoPages flag to true takes care of everything for you and the output HTML consists of multiple pages instead of a single page.

Still want more? You can head-on to the APIs documentation section, PDF to HTML that lists some advance level features for applying more options during conversion. Download your free copy of Aspose.PDF for .NET and you can get started in no time by following the API documentation. If you have any queries, feel free to post to Aspose.PDF forum. We’ll be glad to assist you with your queries and inquiries.