1
votes

I'm looking to iterate through a list of ID numbers which matches ID numbers in an XML file and print the line below using BASH (and AWK) to the shell or redirect it to a third, output file (output.txt)

Here is the breakdown:

ID_list.txt (shortened for this example - it has 100 IDs)

4414
4561
2132
999
1231
34
489
3213
7941

XML_example.txt (thousands of entries)

<book>
  <ID>4414</ID>
  <name>Name of first book</name>
</book>
<book>
  <ID>4561</ID>
  <name>Name of second book</name>
</book>

I'd like the output of the script to be the names of the 100 IDs from the first file:

Name of first book
Name of second book
etc

I believe it's possible to do this using BASH and AWK with a for loop (for each in file 1, find the corresponding name in file2). I think you can recurisvely GREP for the ID number and then print the line below it using AWK. Even if the output looked like this, I can remove the XML tags after:

<name>Name of first book</name>
<name>Name of second book</name>

It's on a Linux server but I can port it over to PowerShell on Windows. I think BASH/GREP and AWK are the way to go.

Can someone help me script this?

4
Show us what you tried and what specifically you're having problems with - otherwise it looks like you want us to write it for you. - user2062950
Shell and/or awk is not the right choice for parsing XML. - chepner
@user2062950, you are right, apologies for not posting my version prior to asking. I was using while read; do and a for i in ID_list.txt solution, but Dogbane's solution(s) below were cleaner. - Mike J
It really isn't that terrible using BASH_REMATCH, though still obviously simpler in a language that includes a package to do it for you. - Reinstate Monica Please

4 Answers

3
votes

Given an ID, you can get the name using XPath xpressions and the xmllint command, like this:

id=4414
name=$(xmllint --xpath "string(//book[ID[text()='$id']]/name)" books.xml)

So with this, you could write something like:

while read id; do
    name=$(xmllint --xpath "string(//book[ID[text()='$id']]/name)" books.xml)
    echo "$name"
done < id_list.txt

Unlike solutions involving awk, grep, and friends, this is using an actual XML parsing tool. This means that while most other solutions might break if they encountered:

<book><ID>4561</ID><name>Name of second book</name></book>

...this would work just fine.

xmllint is part of the libxml2 package, and is available on most distributions.

Note also that recent versions of awk have native XML parsing.

1
votes
$ awk '
NR==FNR{ ids["<ID>" $0 "</ID>"]; next }
found { gsub(/^.*<name>|<[/]name>.*$/,""); print; found=0 }
$1 in ids { found=1 }
' ID_list.txt XML_example.txt
Name of first book
Name of second book
1
votes

Here's one way:

while IFS= read -r id
do
    grep -A1 "<ID>$id</ID>" XML_example.txt | grep "<name>"
done < ID_list.txt

Here's another way (one-liner). This is more efficient because it uses a single grep to extract all the ids instead of looping:

egrep -A1 $(sed -e 's/^/<ID>/g' -e 's/$/<\/ID>/g' ID_list.txt | sed -e :a -e '$!N;s/\n/|/;ta' ) XML_example.txt | grep "<name>"

Output:

<name>Name of first book</name>
<name>Name of second book</name>
0
votes

I would go the BASH_REMATCH route if I had to do it in bash

 BASH_REMATCH
          An  array  variable  whose members are assigned by the =~ binary
          operator to the [[ conditional command.  The element with  index
          0  is  the  portion  of  the  string matching the entire regular
          expression.  The element with index n  is  the  portion  of  the
          string matching the nth parenthesized subexpression.  This vari‐
          able is read-only.

So something like below

#!/bin/bash

while read -r line; do
  [[ $print ]] && [[ $line =~ "<name>"(.*)"</name>" ]] && echo "${BASH_REMATCH[1]}"

  if [[ $line == "<ID>"*"</ID>" ]]; then
    print=:
  else
    print=
  fi
done < "ID_list.txt"

Example output

> abovescript
Name of first book
Name of second book