1
votes

I want to do replacements in MS Word (.docx) document using regular expression (java RegEx):

Example: 
 …, с одной стороны, и %SOME_TEXT% именуемое в дальнейшем «Заказчик», в 
 лице  %SOME_TEXT%   действующего на основании %SOME_TEXT% с другой стороны, 
 заключили настоящий Договор о нижеследующем: …

I tried to get text templates (like %SOME_TEXT%) use Apache POI - XWPF and replace text, but replacement is not guaranteed, because POI separates runs => I get something like this(System.out.println(run.getText(0))):

…
, с одной стороны, и 
%
SOME_TEXT
%

именуемое 
в дальнейшем «Заказчик», в лице

%
SOME
_
TEXT
%

code example:

FileInputStream fis = new FileInputStream(new File("document.docx"));
XWPFDocument document = new XWPFDocument(fis);
List<XWPFParagraph> paragraphs = document.getParagraphs();
paragraphs.forEach(para -> {
    para.getRuns().forEach(run -> {
        String text = run.getText(0);
        if (text != null) {
           System.out.println(text);
           // text replacement process
           // run.setText(newText,0);
        }
    });
});

I have found many similar questions (like this "Replacing a text in Apache POI XWPF "), but did not found answer to my problem (answer here "Seperated text line in Apache POI XWPFRun object" offer inconvenient solution).

I tried to use docx4j and this example => "docx4j find and replace", but docx4j works similar.

For docx4j, see stackoverflow.com/questions/17093781/… – JasonPlutext

I tried to use docx4j => documentPart.variableReplace(mappings);, but replacement not guaranteed(plutext/docx4j).

Did you use VariablePrepare? stackoverflow.com/a/17143488/1031689 – JasonPlutext

Yes, no results:

WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new File("test.docx"));
HashMap<String, String> mappings = new HashMap<>();
VariablePrepare.prepare(wordMLPackage);//see notes
mappings.put("SOME_TEXT", "XXXX");
wordMLPackage.getMainDocumentPart().variableReplace(mappings);
wordMLPackage.save(new File("out.docx"));

Input\output text:

Input:
…, с одной стороны, и ${SOME_TEXT} именуемое в дальнейшем «Заказчик» ...
Output:
…, с одной стороны, и SOME_TEXT именуемое в дальнейшем «Заказчик» ...

To see your runs after VariablePrepare, turn on INFO level logging for VariablePrepare, or just System.out.println(wordMLPackage.getMainDocumentPart().getXML())

I understand that templates were separated to different Runs, but main question of the topic, how not to separate template to different Runs. I use System.out.println(wordMLPackage.getMainDocumentPart().getXML()) and saw:

<w:r>
   <w:t xml:space="preserve">, с одной стороны, и </w:t>
</w:r>
<w:r><w:t>$</w:t></w:r>
<w:r><w:t>{</w:t></w:r>
<w:r>
    <w:rPr>
       <w:rFonts w:eastAsia="Times-Roman"/>
          <w:color w:val="000000" w:themeColor="text1"/>
          <w:lang w:val="en-US"/>
    </w:rPr>
    <w:t>SOME</w:t>        <!-- First part of template: "SOME" -->
</w:r>
<w:r>
    <w:rPr>
        <w:rFonts w:eastAsia="Times-Roman"/>
        <w:color w:val="000000" w:themeColor="text1"/>
    </w:rPr>
    <w:t>_</w:t>           <!-- Second part of template: "_"   -->
</w:r>
<w:r>
    <w:rPr>
        <w:rFonts w:eastAsia="Times-Roman"/>
        <w:color w:val="000000" w:themeColor="text1"/>
        <w:lang w:val="en-US"/>
    </w:rPr>
    <w:t>TEXT</w:t>        <!-- Third part of template: "TEXT" -->
</w:r>
<w:r>
    <w:rPr>
        <w:rFonts w:eastAsia="Times-Roman"/>
        <w:color w:val="000000" w:themeColor="text1"/>
    </w:rPr>
    <w:t>}</w:t>
</w:r>

, that template located in different xml tags and I do not understand WHY...

Please help me to find convenient approach to replace text.....

1
I tried to use docx4j => documentPart.variableReplace(mappings);, but replacement not guaranteed(see question updates).kozmo
Did you use VariablePrepare? stackoverflow.com/a/17143488/1031689JasonPlutext
First POI doesn't separate the runs. If the text is all in a single run, then POI will find it that way. If the text is broken up into multiple runs (lots of things you can do in Microsoft Word will cause this). Then any solution that retrieves text by run will suffer from the same issue. You have to consolidate the text from the runs into a single string. before comparing.jmarkmurphy
Thx, @jmarkmurphy, I know, but I would like to find a way to simply change text in .docx/doc file and do not afraid of separation template's runs.kozmo

1 Answers

4
votes

As you see, the approach "to do replacements in MS Word (.docx) document using regular expression (java RegEx)" is not really good since you never can be sure that the text to replace will be together in one text-run. Better approach is using fields (merge fields or form fields) or content controls in Word.

My favourites for such requirements are still the good old form fields in Word.

First advantage is that even without document protection it will not be possible formatting parts of form field content different and so tearing apart the form field content into different runs (but see note 1). Second advantage is that because of the gray background the form fields are good visible in document content. And another advantage is the possibility applying a document protection so that only filling the form fields will be possibly, even in Word' s GUI. This is really good for preserving such contractual documents from unwanted changings.

(Note 1): At least Word prevents formatting parts of form field content different and so tearing apart the form field content into different runs. Other word-processing software (Writer for example) may not respecting this restriction though.

So I would have the Word template like so:

enter image description here

The grey fields are the good old form Textfields in Word, named Text1, Text2 and Text3. Textfields blocks look like:

<xml-fragment w:rsidR="00833656" 
  ...
 xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" 
 ... >
  <w:rPr>
    <w:rFonts w:eastAsia="Times-Roman"/>
    <w:color w:themeColor="text1" w:val="000000"/>
    <w:lang w:val="en-US"/>
  </w:rPr>
    <w:fldChar w:fldCharType="begin">
      <w:ffData>
        <w:name w:val="Text1"/>
        <w:enabled w:val="0"/>
        <w:calcOnExit w:val="0"/>
        <w:textInput>
          <w:default w:val="<введите заказчика>"/>
        </w:textInput>
      </w:ffData>
    </w:fldChar>
  </xml-fragment>
</xml-fragment>

Then the following code:

import java.io.FileOutputStream;
import java.io.FileInputStream;

import org.apache.poi.xwpf.usermodel.*;

import org.apache.xmlbeans.XmlObject;
import org.apache.xmlbeans.XmlCursor;
import org.apache.xmlbeans.SimpleValue;
import javax.xml.namespace.QName;

public class WordReplaceTextInFormFields {

 private static void replaceFormFieldText(XWPFDocument document, String ffname, String text) {
  boolean foundformfield = false;
  for (XWPFParagraph paragraph : document.getParagraphs()) {
   for (XWPFRun run : paragraph.getRuns()) {
    XmlCursor cursor = run.getCTR().newCursor();
    cursor.selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' .//w:fldChar/@w:fldCharType");
    while(cursor.hasNextSelection()) {
     cursor.toNextSelection();
     XmlObject obj = cursor.getObject();
     if ("begin".equals(((SimpleValue)obj).getStringValue())) {
      cursor.toParent();
      obj = cursor.getObject();
      obj = obj.selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' .//w:ffData/w:name/@w:val")[0];
      if (ffname.equals(((SimpleValue)obj).getStringValue())) {
       foundformfield = true;
      } else {
       foundformfield = false;
      }
     } else if ("end".equals(((SimpleValue)obj).getStringValue())) {
      if (foundformfield) return;
      foundformfield = false;
     }
    }
    if (foundformfield && run.getCTR().getTList().size() > 0) {
     run.getCTR().getTList().get(0).setStringValue(text);
//System.out.println(run.getCTR());
    }
   }
  }
 }

 public static void main(String[] args) throws Exception {

  XWPFDocument document = new XWPFDocument(new FileInputStream("WordTemplate.docx"));

  replaceFormFieldText(document, "Text1", "Моя Компания");
  replaceFormFieldText(document, "Text2", "Аксель Джоачимович Рихтер");
  replaceFormFieldText(document, "Text3", "Доверенность");

  FileOutputStream out = new FileOutputStream("WordReplaceTextInFormFields.docx");
  document.write(out);
  out.close();
  document.close();
 }
}

This code needs the full jar of all of the schemas ooxml-schemas-1.3.jar as mentioned in FAQ-N10025.

Produces:

enter image description here