2
votes

I'm working with Lucene 7.4 and have indexed a sample of txt files. I have some Fields that have been stored, such as path and filename, and a content Field, which was unstored before passing the doc to the IndexWriter. Consequently my content Field contains the processed (e.g. tokenized, stemmed) content data of the file, my filename Field contains the unprocessed filename, the entire String.

try (InputStream stream = Files.newInputStream(file)) {

            // create empty document
            Document doc = new Document();

            // add the last modification time field
            Field lastModField = new StoredField(LuceneConstants.LAST_MODIFICATION_TIME, Files.getAttribute(file, "lastModifiedTime", LinkOption.NOFOLLOW_LINKS).toString());
            doc.add(lastModField);
            
            // add the path Field
            Field pathField = new StringField(LuceneConstants.FILE_PATH, file.toString(), Field.Store.YES);
            doc.add(pathField);

            // add the name Field
            doc.add(new StringField(LuceneConstants.FILE_NAME, file.getFileName().toString(), Field.Store.YES));

            // add the content
            doc.add(new TextField(LuceneConstants.CONTENTS, new BufferedReader(new InputStreamReader(stream))));

            System.out.println("adding " + file);
            writer.addDocument(doc);

Now, as far as I understand, I have to use 2 QueryParsers, since I need to use 2 different Analyzers for searching over both fields, one for each. I cant't figure out how to combine them. What I want is a TopDoc wherein the results are ordered by a relevance score, that is some combination of the 2 relevance scores from the search over the filename Field and the search over the content Field. Does Lucene 7.4 provide you with the means for an easy solution to this?

PS: This is my first post in a long time, if not ever. Please remark any formatting or content issues.

EDIT: Analyzer used for indexing content Field and for searching content Field:

Analyzer myTxtAnalyzer = CustomAnalyzer.builder()
                    .withTokenizer("standard")
                    .addTokenFilter("lowercase")
                    .addTokenFilter("stop")
                    .addTokenFilter("porterstem")
                    .build();

And I'm using the KeywordAnalyzer to search over the filename Field, which, to reiterate, is stored, so not analyzed.

My program is supposed to index files and search over that index, retrieving a list of the most relevant documents. If a searchString, which may contain whitespaces, exactly matches the fileName, I'd like that to heavily impact my search results.

I'm a computer science student, and this is my first project with Lucene. If there are no functions available, it's all good. What I'm asking for is not a requirement for my task. I'm just pondering and I feel like this is something there might already exist a simple solution for. But I can't seem to find it, if it exists.

EDIT 2: I had a misconception aobut what happens when using Stored.YES/.NO. My problem has nothing to do with it. The String wasn't tokenized, because it was in a StringField. I assumed it was because it was stored. However, my question remains. Is there a way to search over tokenized and untokenized Fields concurrently?

1
Can you clarify why you need 2 different analyzers? This is usually only the case if you have different tokenizing/filtering needs for different fields. But using these different field types (StringField and TextField) does not force you to do this, because the StringField is indexed "as-is" - tokenization is skipped. Maybe if you edit your question to show the analyzers and the query parsers you currently need to use, the problem will be clearer.andrewJames

1 Answers

1
votes

As @andrewjames mentions, you don't need to use multiple analyzers in your example because only the TextField gets analyzed, the StringFields do not. If you had a situation where you did need to use different analyzers for different fields, Lucene can accommodate that. To do so you use a PerFieldAnalyzerWrapper which basically let's you specify a default Analyzer and then as many field specific analyzers as you like (passed to PerFieldAnalyzerWrapper as a dictionary). Then when analyzing the doc it will use the field specific analyzer if one was specified and if not, it will use the default analyzer you specified for the PerFieldAnalyzerWrapper.

Whether using a single analyzer or using multiple via PerFieldAnalyzerWrapper, you only need one QueryParser and you will pass that parser either the one analyzer or the PerFieldAnalyzerWrapper which is an analyzer that wraps several analyzers.

The fact that some of your fields are stored and some are not stored has no impact on searching them. The only thing that matters for the search is that the field is indexed, and both StringFields and TextFields are always indexed.

You mention the following:

And I'm using the KeywordAnalyzer to search over the filename Field, which, to reiterate, is stored, so not analyzed.

Whether a field is stored or not has nothing to do with whether it's analyzed. For the filename field your code is using a StringField with Field.Store.YES. Because it's a StringField it will be indexed BUT not analyzed, and because you specified to store the field it will be stored. So since the field is NOT analyzed, it won't be using the KeywordAnalyzer or any other analyzer :-)

Is there a way to search over tokenized and untokenized Fields concurrently?

The real issue here isn't about searching tokenized and untokenized fields concurrently, it's really just about search multiple fields concurrently. The fact that one is tokenized and one is not is of no consequence for lucene. To search multiple fields at once you can use a BooleanQuery and with this query object you can add multiple queries to it, one for each field, and specify an AND ie Must or an OR ie Should relationship between the subqueries.

I hope this helps clear things up for you.