Extract Text from Word Documents in Java

Extract Text from MS Word Documents in Java

In various cases, you may have to programmatically extract text from Word documents. For example, for text analysis, extraction of particular sections, etc. For such cases, this article provides a quick and easy-to-implement method to extract text from Word documents in Java. Also, it will enable you to extract text between different sections/elements of a Word document.

Java Library to Extract Text from Word DOC DOCX
Java Text Extraction in Word Documents
Extract Text from a Word Document in Java

Java Library to Extract Text from Word Documents

Aspose.Words for Java is a powerful library that allows you to create MS Word documents from scratch. Moreover, it lets you manipulate existing Word documents for encryption, conversion, text extraction, etc. We will use this library to extract text from Word DOCX or DOC documents. You can download the API’s JAR or install it using the following Maven configurations.

<repository>
    <id>AsposeJavaAPI</id>
    <name>Aspose Java API</name>
    <url>https://repository.aspose.com/repo/</url>
</repository>
<dependency>
    <groupId>com.aspose</groupId>
    <artifactId>aspose-words</artifactId>
    <version>22.6</version>
    <type>pom</type>
</dependency>

Understanding a Word Document’s Structure

An MS Word document consists of various elements, which include paragraphs, tables, images, etc. Therefore, the requirements of text extraction could vary from one scenario to another. For example, you may need to extract text between paragraphs, bookmarks, comments, etc.

Each type of element in a Word DOC/DOCX is represented as a node. Therefore, to process a document, you will have to play with the nodes. So let’s begin and see how to extract text from Word documents in different scenarios.

Extract Text from a Word DOC in Java

In this section, we are going to implement a Java text extractor for Word documents and the workflow of text extraction would be as follows:

First, we will define the nodes that we want to include in the text extraction process.
Then, we will extract the content between the specified nodes (including or excluding the starting and ending nodes).
Finally, we will use the clone of extracted nodes, e.g. to create a new Word document consisting of extracted content.

Let’s now write a method named extractContent to which we will pass the nodes and some other parameters to perform the text extraction. This method will parse the document and clone the nodes. The following are the parameters that we will pass to this method.

startNode and endNode as starting and ending points for the extraction of the content, respectively. These can be both block level (Paragraph , Table) or inline level (e.g Run, FieldStart, BookmarkStart etc.) nodes.
1. To pass a field you should pass the corresponding FieldStart object.
2. To pass bookmarks, the BookmarkStart and BookmarkEnd nodes should be passed.
3. For comments, the CommentRangeStart and CommentRangeEnd nodes should be used.
isInclusive defines if the markers are included in the extraction or not. If this option is set to false and the same node or consecutive nodes are passed, then an empty list will be returned.

The following is the complete implementation of the extractContent method that extracts the content between the nodes that are passed.

Some helper methods are also required by the extractContent method to accomplish the text extraction operation, which are given below.

Now we are ready to utilize these methods and extract text from a Word document.

Extract Text between Paragraphs of a Word Document

Let’s see how to extract content between two paragraphs in a Word DOCX document. The following are the steps to perform this operation in Java.

First, load the Word document using Document class.
Get reference of the starting and ending paragraphs into two objects using Document.getFirstSection().getChild(NodeType.PARAGRAPH, int, bool) method.
Call extractContent(startPara, endPara, true) method to extract the nodes into an object.
Call generateDocument(Document, extractedNodes) helper method to create document consisting of the extracted content.
Finally, save the returned document using Document.save(String) method.

The following code sample shows how to extract text between the 7th and 11th paragraphs in a Word DOCX in Java.

Extract Text between Different Nodes of Word Document

You can also extract content between different types of nodes. For demonstration, let’s extract content between a paragraph and a table and save it into a new Word document. The following are the steps to extract text between different nodes in a Word document in Java.

Load the Word document using Document class.
Get reference of the starting and ending nodes into two objects using Document.getFirstSection().getChild(NodeType, int, bool) method.
Call extractContent(startPara, endPara, true) method to extract the nodes into an object.
Call generateDocument(Document, extractedNodes) helper method to create document consisting of the extracted content.
Save the returned document using Document.save(String) method.

The following code sample shows how to extract text between a paragraph and a table in a Word DOCX using Java.

Get Text Between Paragraphs based on Styles

Let’s now check out how to extract content between paragraphs based on styles. For demonstration, we are going to extract content between the first “Heading 1” and the first “Heading 3” in the Word document. The following steps demonstrate how to achieve this in Java.

First, load the Word document using Document class.
Then, extract paragraphs into an object using paragraphsByStyleName(Document, “Heading 1”) helper method.
Extract paragraphs into another object using paragraphsByStyleName(Document, “Heading 3”) helper method.
Call extractContent(startPara, endPara, true) method and pass the first elements in both paragraph arrays as first and second parameters.
Call generateDocument(Document, extractedNodes) helper method to create document consisting of the extracted content.
Finally, save the returned document using Document.save(String) method.

The following code sample shows how to extract content between paragraphs based on styles.

Free Word Text Extractor for Java

You can get a free temporary license to extract text from Word documents without evaluation limitations.

Explore Java DOCX Library

You can explore other scenarios of extracting text from Word documents using this documentation article. Besides, you can explore other features of Aspose.Words for Java using the documentation. In case you would have any questions, feel free to let us know via our forum.

Conclusion

In this article, you have learned how to extract text from Word documents in Java. Moreover, you have seen how to extract content between similar or different types of nodes in a DOC or DOCX programmatically. Thus, you can build your own MS Word text extractor in Java.

Java Library to Extract Text from Word Documents#

Understanding a Word Document’s Structure#

Extract Text from a Word DOC in Java#

Extract Text between Paragraphs of a Word Document#

Extract Text between Different Nodes of Word Document#

Get Text Between Paragraphs based on Styles#

Free Word Text Extractor for Java#

Explore Java DOCX Library#

Conclusion#

See Also#