2
votes

I'm using Lucene to search through an index of XML documents. I'm supposed to look for documents that have certain words inside certain tags. What would be the best way to go about this?

I tried to use RegexQuery with something like "tag.*?word.*?tag", but that returned no results.

To clarify, and example of an XML:

<?xml version="1.0" encoding="utf-8"?>
<Legislation>
    <ENTRY COLNAME="COL1">
    <LegBody_1_1 ID="KEY_3">
        <ParagraphNum REFID="284:1" JUMP_LINK_KEY="0">1. </ParagraphNum>In the following pragraphs - </LegBody_1_1>
        <LegBody_1_2 ID="KEY_4">
            <Term>"Legal Guardian" </Term>
            <Definition> - a person to whom legal title to property is entrusted to use for another's benefit; </Definition>
        </LegBody_1_2>
        <LegBody_1_2 ID="KEY_5">
            <Term>"Authority" </Term>
            <Definition> - Any civil servant appointed by the department head or minister; </Definition>
        </LegBody_1_2>

.... more tags..

</Legislation>

A search looking for the word "legal" in the tag "definition" ("definition.*?legal.*?definition") should return this document.

Any ideas?

2

2 Answers

1
votes

I'd also explore native XML databases. eXist-db (http://exist-db.org) has Lucene built in, so you can keep your XML intact and query the structure with XQuery while applying Lucene indexes.