2
votes

I have 1.2M lines to be indexed, each line is read as a document in the lucene index writer. After the index has been built, I try to assert the total number o f records that have been indexed. This number is less than 1.2M.

Details for adding the files is in the following way:

    Directory fsDir = FSDirectory.open(this.indexLoc, NoLockFactory.INSTANCE);
    IndexWriterConfig iwConf = new IndexWriterConfig(analyzer);
    iwConf.setOpenMode(mode);
    IndexWriter indexWriter = new IndexWriter(fsDir, iwConf);
    int count=0;
    FileInputStream input;
    input = new FileInputStream(new File(String.valueOf(dir)));
    CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
    decoder.onMalformedInput(CodingErrorAction.IGNORE);
    InputStreamReader isr = new InputStreamReader(input,decoder);
    BufferedReader reader = new BufferedReader(isr);
    StringBuilder content = new StringBuilder();
    String line;

    while ((line = reader.readLine()) != null) {
        Document d = new Document();
        d.add(new TextField(this.fieldName, line, Store.NO));
        indexWriter.addDocument(d);
        count++;
    }

    indexWriter.commit();
    indexWriter.close();
    reader.close();
    isr.close();
    input.close();
}

The way I get the index docs number is as below:

IndexReader reader = DirectoryReader.open(FSDirectory.open(this.indexLoc));
int docNum = reader.getDocCount(this.fieldName);

I traced the that 1.2M has been added to document d. However, why is variable docNum value less than 1.2M?

When I test with small size document, say 1k, the two numbers are consistent.

p.s., I'm using lucene 5.0.

1

1 Answers

1
votes

IndexReader.getDocCount(String field) will return the number of documents that have at least one term for this field. So if the line is empty, the document count will not increase by addDocument.