0
votes

I would like to index 1 million of html documents in Lucene. I need to index in one Lucene document several html files. Lately, I would like to know in the search response the original html document.

So, for example I have:

1.home.html
2.history.html
3.about.html

4.home2.html
...

I want to index 1, 2 and 3 in the same Lucene document. Then, if I search any text I want to know the original document (home, history or about).

I have been searching in Internet and I found Lucene payload. So I have been thinking about indexing the url of the original document in all the terms. Is this a good solution? the performance would be allright?

Thanks very much for your help.

3
you are storing only names of html files or the whole content of the html files?SSaikia_JtheRocker
Payloads might provide an acceptable solution. A good solution would be to store the pages as separate documents. Why do you want to index these three pages in the same document?femtoRgon
I am storing the whole content of the documents and also I would like to store the name of the documents. I already have implemented the separated pages solution and it works perfectly, but I need to search in group (ex: home, history and about) as I said before and the only way that I found is using Payloads.Hibernator
what about payload and highlighting the paragraphs? would it be ok?Hibernator

3 Answers

1
votes

I have been working two days on this problem and I think I found the solution.

I index every html page in one document using an ID like for example:

1.home.html     id1  htmlcontent
2.history.html  id1  htmlcontent
3.about.html    id1  htmlcontent

4.home2.html    id2  htmlcontent
...

Lately I can make use org.apache.lucene.search.grouping to group the results by this id.

http://lucene.apache.org/core/3_2_0/api/contrib-grouping/org/apache/lucene/search/grouping/package-summary.html

Hope this helps anybody :)

0
votes

I think what you need is Apache Solr http://lucene.apache.org/solr/, its uses Lucene as indexing engine and has querying / web interface for searching.

look at this tutorial on the site http://lucene.apache.org/solr/4_3_1/tutorial.html

0
votes

They are two different lucene features:

1.Grouping : it allows to group search results by specified field. For example, if you group by the author field, then all documents with the same value in the author field fall into a single group. You will have a kind of tree as output.

http://lucene.apache.org/core/3_2_0/api/contrib-grouping/org/apache/lucene/search/grouping/package-summary.html

2.facet: this feature doesn't group documents, it just tells you how many documents fall in a specific value of a facet. For example, if you have a facet based on the author field, you will receive a list of all your authors, and for each author you will know how many documents belong to that specific author. After, if you want to see those documents, you have to query one more time adding a specific filter (author=whatever). The faceted search is in fact based on browsing documents applying multiple filters to progressively reach the documents you're really interested in.

here is some tutorials

http://lucene.apache.org/core/4_3_1/facet/org/apache/lucene/facet/doc-files/userguide.html

http://lucene.apache.org/core/4_3_1/facet/org/apache/lucene/facet/search/package-summary.html

just go through it and work out as per your needs