Extract text based on columns, Identify Image colorspace, Create multi-layer PDF, disable file compression when adding as embedded resources with Aspose.Pdf for Java 10.0.0

Share on FacebookTweet about this on TwitterShare on LinkedIn

Aspose.Pdf for Java logoThe new release of Aspose.Pdf for Java 10.0.0 has been published and like previous versions, this release contains some new and great features which have already been supported in its sibling (Aspose.Pdf for .NET). Following are the details of new features and enhancements introduced with this release.

Identify if image in PDF is Colored or Black & White

A PDF file may consist of Text, Image, Attachment, Graph, Annotations and other elements and Aspose.Pdf for Java provides the feature to add as well as manipulate image in existing PDF file. Different type of compression can be applied over images to reduce their size. The type of compression being applied over image depends upon the ColorSpace of source image i.e. if image is Color (RGB), then apply JPEG2000 compression, and if it is Black & White, then JBIG2/JBIG2000 compression should be applied. Therefore identifying each image type and using an appropriate type of compression will create best/optimized output. We may come across a requirement to determine image Color space and apply appropriate compression for image to reduce PDF file size. Please visit the following link for related information on Identify if image inside PDF is Colored or Black & White.

The following code snippet shows the steps to Identify if image inside PDF is Colored or Black & White.

// read source PDF file
com.aspose.pdf.Document document = new com.aspose.pdf.Document("c:/pdftest/test4.pdf");
try /*JAVA: was using*/
{
	// iterate through all pages of PDF file
   for (com.aspose.pdf.Page page : (Iterable
) document.getPages())
   {
	   // create Image Placement Absorber instance
	   com.aspose.pdf.ImagePlacementAbsorber abs = new com.aspose.pdf.ImagePlacementAbsorber();
	   page.accept(abs);
	   for (com.aspose.pdf.ImagePlacement ia : (Iterable) abs.getImagePlacements())
	   {
		   /*ColorType*/
		   int colorType = ia.getImage().getColorType();
		   switch (colorType)
		   {
		   	case ColorType.Grayscale:
		   		System.out.println("Grayscale Image");
		   		break;
		   	case ColorType.Rgb:
		   		System.out.println("Colored Image");
		   		break;
		   }
	   }
   }
}
catch(Exception ex)
{
	System.out.println("Error reading file = " + document.getFileName());
}

Extract text based on columns

In case we have a PDF document with more than one columns (multi-column) PDF document and we need to extract the page contents while honoring the same layout, then Aspose.Pdf for .NET is the right choice to accomplish this requirement. One approach is to reduce font size of contents inside PDF document and then perform text extraction. For further details, please visit Extract text based+on columns

The following code snippet can be used to fulfill this requirement.

String path = "D:\\Temp\\";
// instantiate Document instance with path of input file as argument
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(path + "net_New-age NED's.pdf");
// create TextFragment Absorber instance
com.aspose.pdf.TextFragmentAbsorber tfa = new com.aspose.pdf.TextFragmentAbsorber();
pdfDocument.getPages().accept(tfa);
// create TextFragment Collection instance
com.aspose.pdf.TextFragmentCollection tfc = tfa.getTextFragments();
for (TextFragment tf : (Iterable) tfc)
{
    //need to reduce font size at least for 70%
    tf.getTextState().setFontSize(tf.getTextState().getFontSize()* 0.7f);
}
// temporary save the file
pdfDocument.save("c:/pdftest/TempOutput.pdf");
pdfDocument = new com.aspose.pdf.Document("c:/pdftest/TempOutput.pdf");

com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
pdfDocument.getPages().accept(textAbsorber);
String extractedText = textAbsorber.getText();
textAbsorber.visit(pdfDocument);

//Create a writer and open the file
java.io.FileWriter writer = new java.io.FileWriter(new java.io.File("c:/pdftest/Extracted_text.txt"));
writer.write(extractedText);

//Write a line of text to the file
//tw.WriteLine(extractedText);
//Close the stream
writer.close();

In this new release, we also have introduced several improvements in TextAbsorber and in internal text formatting mechanism. So now during the text extraction using ‘Pure’ mode, you may call setScaleFactor(..) method and it can be another approach to extract text from multi-column PDF document besides above stated approach. This scale factor may be set to adjust grid which is used for the internal text formatting mechanism during text extraction. Specifying the ScaleFactor values between 1 and 0.1 (including 0.1) has the same effect as font reducing.

Specifying the ScaleFactor values between 0.1 and -0.1 is treated as zero value, but it makes algorithm to calculate scale factor needed during extracting text automatically. The calculation is based on average glyph width of most popular font on the page, but we cannot guarantee that in extracted text no string of column is reaches the start of next column. Please note that if ScaleFactor value is not specified, the default value of 1.0 will be used. It means no scaling will be carried out. If specified ScaleFactor value is more than 10 or less than -0.1, the default value of 1.0 will be used.

We propose the usage of auto-scaling (ScaleFactor = 0) when processing large number PDF files for text content extraction. Or manually set redundant reducing of grid width ( about ScaleFactor = 0.5). However you must not determine whether scaling is necessary for concrete document or not. If You set redundant reducing of grid width for the document (that doesn’t need in it), the extracted text content will remain fully adequate. Please take a look over following code snippet..

Document pdfDocument = new Document("inputFile.pdf");
TextAbsorber textAbsorber = new TextAbsorber();
textAbsorber.setExtractionOptions(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure));
//Setting scale factor to 0.5 is enough to split columns in the majority of documents
//Setting of zero allows to algorithm choose scale factor automatically
textAbsorber.getExtractionOptions().setScaleFactor((double)0.5);
pdfDocument.getPages().accept(textAbsorber);
String extractedText = textAbsorber.getText();

Create Multi-layer PDF file

Layers can be used in PDF documents in many ways. You might have a multi-lingual file that you want to distribute and want text in each language to appear on different layers, with the background design appearing on a separate layer. You might also create documents with animation that appears on a separate layer. One example could be to add a license agreement to your file, and you don’t want a user to view the content until they agree to the terms of the agreement.

Document doc = new Document();
Page page = doc.getPages().add();
Layer layer = new Layer("oc1", "Red Line");
layer.getContents().add(new Operator.SetRGBColorStroke(1, 0, 0));
layer.getContents().add(new Operator.MoveTo(500, 700));
layer.getContents().add(new Operator.LineTo(400, 700));
layer.getContents().add(new Operator.Stroke());
page.setLayers( new ArrayList());
page.getLayers().add(layer);
layer = new Layer("oc2", "Green Line");
layer.getContents().add(new Operator.SetRGBColorStroke(0, 1, 0));
layer.getContents().add(new Operator.MoveTo(500, 750));
layer.getContents().add(new Operator.LineTo(400, 750));
layer.getContents().add(new Operator.Stroke());
page.getLayers().add(layer);
layer = new Layer("oc3", "Blue Line");
layer.getContents().add(new Operator.SetRGBColorStroke(0, 0, 1));
layer.getContents().add(new Operator.MoveTo(500, 800));
layer.getContents().add(new Operator.LineTo(400, 800));
layer.getContents().add(new Operator.Stroke());
page.getLayers().add(layer);
doc.save("output.pdf");

As well as the enhancements and features discussed above, there have been numerous fixes related to recently introduced PDF to DOC conversion, PDF to Excel conversion, PDF to HTML conversion, PDF to PDF/A conversion, XPS to PDF conversion, PDF to TIFF conversion, text replacement, text extraction, rendering PDF files to XPS and creating TOCs in PDF files. Please download and try the latest Aspose.Pdf for .NET 10.0.0 release.