2
votes

I am new in lucene I want to indexing with lucene of large xml files(15GB) that contain plain text as well as attribute and so many xml tags. how to parse and indexing this xml file using lucene with any sample and if we use lucene we need any database

How to parse and index huge xml file using lucene ? Any sample or links would be helpful to me to understand the process. Another one, if I use lucene, will I need any database, as I have seen and done indexing with Databases..

2

2 Answers

1
votes

Your indexing would be build as you would have done using a database, just iterate through all data you want to index and write it to the index. Just go with the XmlReader class to parse your xml in a forward-only fashion. You will, just as with a database, need to index some kind of primary-key so you know what the search result represents.

A database helps when it comes to looking up the indexed data from the primary-key. It will be messy to read the data for a primary-key if you need to iterate a 15 GiB xml file at every request.

A database is not required, but it helps a lot. I would build this as an import tool that reads your xml, dumps it into your database, and then use your "normal" database indexing code you've built before.

0
votes

You might like to look at Michael Sokolov's Lux product, which combines Lucene and Saxon:

http://www.mail-archive.com/[email protected]/msg84102.html

I haven't used it myself and can't claim to fully understand its capabilities.