I have 1.2M lines to be indexed, each line is read as a document in the lucene index writer. After the index has been built, I try to assert the total number o f records that have been indexed. This number is less than 1.2M.
Details for adding the files is in the following way:
Directory fsDir = FSDirectory.open(this.indexLoc, NoLockFactory.INSTANCE);
IndexWriterConfig iwConf = new IndexWriterConfig(analyzer);
iwConf.setOpenMode(mode);
IndexWriter indexWriter = new IndexWriter(fsDir, iwConf);
int count=0;
FileInputStream input;
input = new FileInputStream(new File(String.valueOf(dir)));
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
InputStreamReader isr = new InputStreamReader(input,decoder);
BufferedReader reader = new BufferedReader(isr);
StringBuilder content = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
Document d = new Document();
d.add(new TextField(this.fieldName, line, Store.NO));
indexWriter.addDocument(d);
count++;
}
indexWriter.commit();
indexWriter.close();
reader.close();
isr.close();
input.close();
}
The way I get the index docs number is as below:
IndexReader reader = DirectoryReader.open(FSDirectory.open(this.indexLoc));
int docNum = reader.getDocCount(this.fieldName);
I traced the that 1.2M has been added to document d. However, why is variable docNum value less than 1.2M?
When I test with small size document, say 1k, the two numbers are consistent.
p.s., I'm using lucene 5.0.