Lucene indexing html documents

Question

I would like to index 1 million of html documents in Lucene. I need to index in one Lucene document several html files. Lately, I would like to know in the search response the original html document.

So, for example I have:

1.home.html
2.history.html
3.about.html

4.home2.html
...

I want to index 1, 2 and 3 in the same Lucene document. Then, if I search any text I want to know the original document (home, history or about).

I have been searching in Internet and I found Lucene payload. So I have been thinking about indexing the url of the original document in all the terms. Is this a good solution? the performance would be allright?

Thanks very much for your help.

you are storing only names of html files or the whole content of the html files? — SSaikia_JtheRocker
Payloads might provide an acceptable solution. A good solution would be to store the pages as separate documents. Why do you want to index these three pages in the same document? — femtoRgon
I am storing the whole content of the documents and also I would like to store the name of the documents. I already have implemented the separated pages solution and it works perfectly, but I need to search in group (ex: home, history and about) as I said before and the only way that I found is using Payloads. — Hibernator
what about payload and highlighting the paragraphs? would it be ok? — Hibernator

Hibernator Hibernator · Accepted Answer · 2013-07-09T12:45:38

I have been working two days on this problem and I think I found the solution.

I index every html page in one document using an ID like for example:

1.home.html     id1  htmlcontent
2.history.html  id1  htmlcontent
3.about.html    id1  htmlcontent

4.home2.html    id2  htmlcontent
...

Lately I can make use org.apache.lucene.search.grouping to group the results by this id.

http://lucene.apache.org/core/3_2_0/api/contrib-grouping/org/apache/lucene/search/grouping/package-summary.html

Hope this helps anybody :)

Lucene indexing html documents

3 Answers