I'm working with Lucene 7.4 and have indexed a sample of txt files. I have some Fields that have been stored, such as path and filename, and a content Field, which was unstored before passing the doc to the IndexWriter. Consequently my content Field contains the processed (e.g. tokenized, stemmed) content data of the file, my filename Field contains the unprocessed filename, the entire String.
try (InputStream stream = Files.newInputStream(file)) {
// create empty document
Document doc = new Document();
// add the last modification time field
Field lastModField = new StoredField(LuceneConstants.LAST_MODIFICATION_TIME, Files.getAttribute(file, "lastModifiedTime", LinkOption.NOFOLLOW_LINKS).toString());
doc.add(lastModField);
// add the path Field
Field pathField = new StringField(LuceneConstants.FILE_PATH, file.toString(), Field.Store.YES);
doc.add(pathField);
// add the name Field
doc.add(new StringField(LuceneConstants.FILE_NAME, file.getFileName().toString(), Field.Store.YES));
// add the content
doc.add(new TextField(LuceneConstants.CONTENTS, new BufferedReader(new InputStreamReader(stream))));
System.out.println("adding " + file);
writer.addDocument(doc);
Now, as far as I understand, I have to use 2 QueryParsers, since I need to use 2 different Analyzers for searching over both fields, one for each. I cant't figure out how to combine them. What I want is a TopDoc wherein the results are ordered by a relevance score, that is some combination of the 2 relevance scores from the search over the filename Field and the search over the content Field. Does Lucene 7.4 provide you with the means for an easy solution to this?
PS: This is my first post in a long time, if not ever. Please remark any formatting or content issues.
EDIT: Analyzer used for indexing content Field and for searching content Field:
Analyzer myTxtAnalyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("lowercase")
.addTokenFilter("stop")
.addTokenFilter("porterstem")
.build();
And I'm using the KeywordAnalyzer to search over the filename Field, which, to reiterate, is stored, so not analyzed.
My program is supposed to index files and search over that index, retrieving a list of the most relevant documents. If a searchString, which may contain whitespaces, exactly matches the fileName, I'd like that to heavily impact my search results.
I'm a computer science student, and this is my first project with Lucene. If there are no functions available, it's all good. What I'm asking for is not a requirement for my task. I'm just pondering and I feel like this is something there might already exist a simple solution for. But I can't seem to find it, if it exists.
EDIT 2: I had a misconception aobut what happens when using Stored.YES/.NO. My problem has nothing to do with it. The String wasn't tokenized, because it was in a StringField. I assumed it was because it was stored. However, my question remains. Is there a way to search over tokenized and untokenized Fields concurrently?
StringField
andTextField
) does not force you to do this, because theStringField
is indexed "as-is" - tokenization is skipped. Maybe if you edit your question to show the analyzers and the query parsers you currently need to use, the problem will be clearer. – andrewJames