1. Overview
In this tutorial, we’ll replace a pattern in various locations of a Word document. We’ll work with both .doc and .docx files.
2. The Apache POI Library
The Apache POI library provides Java APIs for manipulating various file formats used by Microsoft Office applications, such as Excel spreadsheets, Word documents, and PowerPoint presentations. It permits to read, write, and modify such files programmatically.
To edit .docx files, we’ll add the latest version of poi-ooxml to our pom.xml:
<dependency>
<groupId>org.apache.poi</groupId>
.<artifactId>poi-ooxml</artifactId>
<version>5.2.5</version>
</dependency>
Additionally, we’ll also need the latest version of poi-scratchpad to deal with .doc files:
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>5.2.5</version>
</dependency>
3. File Handling
We want to create example files, read them, replace some text in the file, and then write the result file. Let’s talk about everything that concerns file handling first.
3.1. Example Files
Let’s create a Word document. We’ll want to replace the word Baeldung in it with the word Hello. Thus, we’ll write Baeldung in multiple locations of the files, especially in a table, various document sections, and paragraphs. We also want to use diverse formatting styles, including one occurrence with a format change inside the word. We’ll use the same document once saved as a .doc file and once as a .docx:
3.2. Reading the Input File
First, we need to read the file. We’ll put it in the resources folder to make it available in the classpath. This way, we’ll get an InputStream. For a .doc document, we’ll create a POIFSFileSystem object based on this InputStream. Lastly, we can retrieve the HWPFDocument object we’ll modify. We’ll use a try-with-resources so that the InputStream and POIFSFileSystem objects are closed automatically. However, as we’ll make modifications to the HWPFDocument, we’ll close it manually:
public void replaceText() throws IOException {
String filePath = getClass().getClassLoader()
.getResource("baeldung.doc")
.getPath();
try (InputStream inputStream = new FileInputStream(filePath); POIFSFileSystem fileSystem = new POIFSFileSystem(inputStream)) {
HWPFDocument doc = new HWPFDocument(fileSystem);
// replace text in doc and save changes
doc.close();
}
}
When dealing with a .docx document, it’s slightly more straightforward, as we can directly derive an XWPFDocument object from the InputStream:
public void replaceText() throws IOException {
String filePath = getClass().getClassLoader()
.getResource("baeldung.docx")
.getPath();
try (InputStream inputStream = new FileInputStream(filePath)) {
XWPFDocument doc = new XWPFDocument(inputStream);
// replace text in doc and save changes
doc.close();
}
}
3.3. Writing the Output File
We’ll write the output document into the same file. As a result, the modified file will be located in the target folder. HWPFDocument and XWPFDocument classes both expose a write() method to write the document to an OuputStream. For instance, for a .doc document, it all boils down to:
private void saveFile(String filePath, HWPFDocument doc) throws IOException {
try (FileOutputStream out = new FileOutputStream(filePath)) {
doc.write(out);
}
}
4. Replacing Text in a .docx Document
Let’s try to replace the occurrences of the word Baeldung in the .docx document and see what challenges we face in the process.
4.1. Naive Implementation
We’ve already parsed the document into an XWPFDocument object. An XWPFDocument is divided into various paragraphs. The paragraphs inside the core of the file are available directly. However, to access the ones inside a table, it is necessary to loop over all the rows and cells of the tables. Leaving the writing of the method replaceTextInParagraph() for later on, here is how we’ll apply it repetitively to all the paragraphs:
private XWPFDocument replaceText(XWPFDocument doc, String originalText, String updatedText) {
replaceTextInParagraphs(doc.getParagraphs(), originalText, updatedText);
for (XWPFTable tbl : doc.getTables()) {
for (XWPFTableRow row : tbl.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
replaceTextInParagraphs(cell.getParagraphs(), originalText, updatedText);
}
}
}
return doc;
}
private void replaceTextInParagraphs(List<XWPFParagraph> paragraphs, String originalText, String updatedText) {
paragraphs.forEach(paragraph -> replaceTextInParagraph(paragraph, originalText, updatedText));
}
In Apache POI, paragraphs are divided into XWPFRun objects. As a first shot, let’s try to iterate over all runs: if we detect the text we want to replace inside a run, we’ll update the content of the run:
private void replaceTextInParagraph(XWPFParagraph paragraph, String originalText, String updatedText) {
List<XWPFRun> runs = paragraph.getRuns();
for (XWPFRun run : runs) {
String text = run.getText(0);
if (text != null && text.contains(originalText)) {
String updatedRunText = text.replace(originalText, updatedText);
run.setText(updatedRunText, 0);
}
}
}
To conclude, we’ll update replaceText() to include all the steps:
public void replaceText() throws IOException {
String filePath = getClass().getClassLoader()
.getResource("baeldung-copy.docx")
.getPath();
try (InputStream inputStream = new FileInputStream(filePath)) {
XWPFDocument doc = new XWPFDocument(inputStream);
doc = replaceText(doc, "Baeldung", "Hello");
saveFile(filePath, doc);
doc.close();
}
}
Let’s now run this code, for instance, through a unit test. We can have a look at a screenshot of the updated document:
4.2. Limitations
As we can see in the screenshot, most occurrences of the word Baeldung have been replaced with the word Hello. However, we can see two remaining Baeldung.
Let’s now understand deeper what XWPFRun is. Each run represents a continuous sequence of text with a common set of formatting properties. The formatting properties include font style, size, color, boldness, italics, underlining, etc. Whenever there is a format change, there is a new run. This is why the occurrence with various formattings in the table is not replaced: its content is spread over multiple runs.
However, the bottom blue Baeldung occurrence wasn’t replaced either. Indeed, Apache POI doesn’t guarantee that characters with the same formatting properties are part of the same run. In a nutshell, the naive implementation is good enough for the simplest cases. It is worth using this solution in such cases because it doesn’t imply any complex decision. However, if we’re confronted with this limitation, we’ll need to move toward another solution.
4.3. Dealing With Text Spread Over Multiple Character Run
For the sake of simplicity, we’ll make the following assumption: it is ok for us to lose the formatting of a paragraph when we find the word Baeldung inside it. Thus, we can remove all existing runs inside the paragraph and replace them with a single new one. Let’s rewrite replaceTextInParagraph():
private void replaceTextInParagraph(XWPFParagraph paragraph, String originalText, String updatedText) {
String paragraphText = paragraph.getParagraphText();
if (paragraphText.contains(originalText)) {
String updatedParagraphText = paragraphText.replace(originalText, updatedText);
while (paragraph.getRuns().size() > 0) {
paragraph.removeRun(0);
}
XWPFRun newRun = paragraph.createRun();
newRun.setText(updatedParagraphText);
}
}
Let’s have a look at the result file:
As expected, every occurrence is now replaced. However, most formatting is lost. The last format isn’t lost. In this case, it seems that Apache POI handles formatting properties differently.
As a last remark, let’s note that depending on our use case, we could also decide to keep some formatting of the original paragraph. We’d then need to iterate over all the runs and keep or update properties as we like.
5. Replacing a Text in a .doc Document
Things are much more straightforward for doc files. We can indeed access a Range object on the whole document. We are then able to modify the content of the range via its replaceText() method:
private HWPFDocument replaceText(HWPFDocument doc, String originalText, String updatedText) {
Range range = doc.getRange();
range.replaceText(originalText, updatedText);
return doc;
}
Running this code leads to the following updated file:
As we can see, the replacement took place all over the file. We can also notice that the default behavior for texts spread over multiple runs is to keep the formatting of the first run.
6. Conclusion
In this article, we replaced a pattern in a Word document. In a .doc document, it was pretty straightforward. However, in a .docx, we experienced some limitations with the easy-going implementation. We showcased an example of overcoming this limitation by making a simplification hypothesis.
As always, the code is available over on GitHub.