0
votes

I am using apache lucene to index the html files. I am storing the path of the html files in the lucene index . Its storing the index and , i have checked it in luke all. But when i am searching the path of the file its returning the no of documents very much high . i want it should search the exact path as it was stored in the lucene index. i am using the following code

for index creation


   try{
         File indexDir=new File("d:/abc/")
        IndexWriter indexWriter = new IndexWriter(
             FSDirectory.open(indexDir),
            new SimpleAnalyzer(),
            true,
            IndexWriter.MaxFieldLength.LIMITED);
            indexWriter.setUseCompoundFile(false);
        Document doc= new Document();
        String path=f.getCanonicalPath();
          doc.add(new Field("fpath",path,
        Field.Store.YES,Field.Index.ANALYZED));
        indexWriter.addDocument(doc);
        indexWriter.optimize();
        indexWriter.close();
     }
    catch(Exception ex )
    {
     ex.printStackTrace();
    }



  Following the code for searching the filepath

        File indexDir = new File("d:/abc/");
           int maxhits = 10000000;
                     int len = 0;
                try {
                    Directory directory = FSDirectory.open(indexDir);
                     IndexSearcher searcher = new IndexSearcher(directory, true);
                    QueryParser parser = new QueryParser(Version.LUCENE_36,"fpath", new SimpleAnalyzer());
                    Query query = parser.parse(path);
                    query.setBoost((float) 1.5);
                    TopDocs topDocs = searcher.search(query, maxhits);
                    ScoreDoc[] hits = topDocs.scoreDocs;
                   len = hits.length;
                   JOptionPane.showMessageDialog(null,"items found"+len);

                 }
                catch(Exception ex)
               {
                 ex.printStackTrace();
              }

its showing the no of documents found as total no of document while the searched path file exists only once

1

1 Answers

1
votes

You are analyzing the path, which will split it into separate terms. The root path term (like catalog in /catalog/products/versions) likely occurs in all documents, so any search that includes catalog without forcing all terms to be mandatory will return all documents.

You need a search query like (using the example above):

+catalog +products +versions

to force all terms to be present.

Note that this gets more complicated if the same set of terms can occur in different orders, like:

/catalog/products/versions
/versions/catalog/products/SKUs

In that case, you need to use a different Lucene tokenizer than the tokenizer in the Standard Analyzer.