1
votes

Is there any Windows app that will search for a string of text within fields in a Word (DOCX) document? Apps like Agent Ransack and its big brother FileLocator Pro can find strings in the Word docs but seem incapable of searching within fields.

For example, I would like to be able to find all occurrences of the string "getProposalTranslations" within a collection of Word documents that have fields with syntax like this:

{ AUTOTEXTLIST  \t "<wr:out select='$.shared_quote_info' datasource='getProposalTranslations'/>" }

Note that string doesn't appear within the text of the document itself but rather only within a field. Essentially the DOCX file is just a zip file, I believe, so if there's a tool that can grep within archives, that might work. Note also that I need to be able to search across hundreds or perhaps thousands of files in many directories, so unzipping the files one by one isn't feasible. I haven't found anything on my own and thought I'd ask here. Thanks in advance.

1
Try this to get you started: unzip -p [filename] | egrep 'search|terms|here' - Dustin Nieffenegger
Thanks, Dustin. I updated my question to better emphasize that I need to do this across many files/directories, and unzipping all of them isn't feasible. Appreciate the suggestion, though! - B.Rossow
I don't think you can search the content of those files without unzipping all of them. A script can accomplish this quickly and remove the temporary files after they are done being used. I'll update you with a possible solution. - Dustin Nieffenegger

1 Answers

3
votes

This script should accomplish what you are trying to do. Let me know if that isn't the case. I don't usually write entire scripts because it can hurt the learning process, so I have commented each command so that you might learn from it.

#!/bin/sh

# Create ~/tmp/WORDXML folder if it doesn't exist already
mkdir -p ~/tmp/WORDXML

# Change directory to ~/tmp/WORDXML
cd ~/tmp/WORDXML

# Iterate through each file passed to this script
for FILE in $@; do
{
    # unzip it into ~/tmp/WORDXML
    # 2>&1 > /dev/null discards all output to the terminal
    unzip $FILE 2>&1 > /dev/null

    # find all of the xml files
    find -type f -name '*.xml' | \

    # open them in xmllint to make them pretty. Discard errors.
    xargs xmllint --recover --format 2> /dev/null | \

    # search for and report if found
    grep 'getProposalTranslations' && echo " [^ found in file '$FILE']"

    # remove the temporary contents
    rm -rf ~/tmp/WORDXML/*

}; done

# remove the temporary folder
rm -rf ~/tmp/WORDXML

Save the script wherever you like. Name it whatever you like. I'll name it docxfind. Make it executable by running chmod +x docxfind. Then you can run the script like this (assuming your terminal is running in the same directory): ./docxfind filenames...