I have some html stripped out of an online dictionary that I want to convert to XML for eventual conversion into a word list for a BK-tree. The online dictionary records variant spellings, but sometimes does so by putting a vowel or ending that may or may not appear in parenthesis, like so:
<td>
<span class="FORM">
<span class="HDORTH">a</span>
<span class="POS"> indef. art. </span> Also
<span class="ORTH">an</span>. Early forms: as subj.,
<span class="ORTH">ane</span>,
<span class="ORTH">on</span>,
<span class="ORTH">o</span>; as obj.,
<span class="ORTH">ane</span>,
<span class="ORTH">on(e</span>,
<span class="ORTH">o</span>, & (chiefly masc.)
<span class="ORTH">an(n)e</span>,
<span class="ORTH">æn(n)e</span>,
<span class="ORTH">en(n)e</span>,
<span class="ORTH">en</span>; after prep.,chiefly
<span class="ORTH">ane</span>,
<span class="ORTH">on(e</span>, masc. also
<span class="ORTH">anne</span>,
<span class="ORTH">æn(n)e</span>, fem. also
<span class="ORTH">anre</span>,
<span class="ORTH">are</span>,
<span class="ORTH">hare</span>,
<span class="ORTH">ore</span>; gen.
<span class="ORTH">anes</span>,
<span class="ORTH">æn(n)es</span>,
<span class="ORTH">en(n)es</span>.</span>
</td>
I've written the following XQuery to convert the HTML TO XML, stripping anything that's not in tags and selecting elements based on the class of the particular span:
declare function local:node-change($nodes as node()*) as node()* {
for $span in $nodes
return
if ($span/@class = "HDORTH") then <headword>{$span/text()}</headword>
else if ($span/@class = "POS") then <part_of_speech>{$span/text()}</part_of_speech>
else if ($span/@class = "ORTH") then <variant>{$span/text()}</variant>
else $span
} ;
<list>
{
let $collection:=concat($collection, '?select=*.xml')
let $q:=collection($collection)
for $y in $q
let $s := $y/td/span/*
let $c := local:node-change($s)
(:let $l := local:stripleftparen($c):)
order by number(substring(substring-before(tokenize(document-uri($y), "/")[last()],"."),4))
return
<entry ref="{number(substring(substring-before(tokenize(document-uri($y), "/")[last()],"."),4))}">{$c}</entry>
}
</list>
This returns the following XML:
<entry ref="3">
<headword>a</headword>
<part_of_speech> indef. art. </part_of_speech>
<variant>an</variant>
<variant>ane</variant>
<variant>on</variant>
<variant>o</variant>
<variant>ane</variant>
<variant>on(e</variant>
<variant>o</variant>
<variant>an(n)e</variant>
<variant>æn(n)e</variant>
<variant>en(n)e</variant>
<variant>en</variant>
<variant>ane</variant>
<variant>on(e</variant>
<variant>anne</variant>
<variant>æn(n)e</variant>
<variant>anre</variant>
<variant>are</variant>
<variant>hare</variant>
<variant>ore</variant>
<variant>anes</variant>
<variant>æn(n)es</variant>
<variant>en(n)es</variant>
</entry>
What I need to do now is clone the nodes that have parens, so that I can modify the clone and have the following result, but I'm not sure how to do so.
<entry ref="3">
<headword>a</headword>
<part_of_speech> indef. art. </part_of_speech>
<variant>an</variant>
<variant>ane</variant>
<variant>on</variant>
<variant>o</variant>
<variant>ane</variant>
<variant>on</variant>
<variant>one</variant>
<variant>o</variant>
<variant>ane</variant>
<variant>anne</variant>
<variant>æne</variant>
<variant>ænne</variant>
<variant>ene</variant>
<variant>enne</variant>
<variant>en</variant>
<variant>ane</variant>
<variant>on</variant>
<variant>one</variant>
<variant>anne</variant>
<variant>æne</variant>
<variant>ænne</variant>
<variant>anre</variant>
<variant>are</variant>
<variant>hare</variant>
<variant>ore</variant>
<variant>anes</variant>
<variant>ænes</variant>
<variant>ænnes</variant>
<variant>enes</variant>
<variant>ennes</variant>
</entry>
I know I need to use substring, substring-before, or substring-after to actually modify the node, but where I'm having problems is in the actual cloning process. Copy doesn't work within the for/return loop, and everything I've found online either suggests that for copying nodes or talks about de-duplicating data (which I'll need to do, but I want to get it exactly as I want it before I do so). How can I copy a node, modify the copy, and display the results so that I can get what I'm looking for?