I am creating a Lucene Index for values got from database. I have set Index OpenMode
as OpenMode.CREATE_OR_APPEND
.
Index creation step is part of a Spring Batch Job.
My understanding is that when I run job for the first time, indexing might take a while but when I rerun the job again for same unchanged source data, it should be fast because document is already there so UPDATE OR INSERT has not be performed.
But for my case, subsequent indexing attempts for same unchanged source data gets slower and slower.
Answer to this question says that it will be handled automatically based on a term.
I am not sure as how to I define the term in my case to handle this?
Below is my sample code,
public Integer createIndex(IndexWriter writer, String str, LuceneIndexerInputVO luceneInputVO) throws Exception {
Integer count = 0;
Document d = null;
txtFieldType.setTokenized(false);
strFieldType.setTokenized(false);
List<IndexVO> indexVO = null;
indexVO = jdbcTemplate.
query(Constants.SELECT_FROM_TABLE1,
new Object[] {luceneInputVO.getId1(), luceneInputVO.getId2(), str},
new IndexRowMapper());
while (!indexVO.isEmpty()) {
d = new Document();
d.add(getStringField(Constants.ID, String.valueOf(luceneInputVO.getId())));
.....
....
writer.addDocument(d);
indexVO.remove(indexVO.get(count));
count++;
}
return count;
}
What should I change in above code to not perform indexing when there is no change in source data?
I am a beginner to Lucene and not sure as how to define that Term
which would decide about duplicity.
I don't want indices to be recreated and I wish new Document
to be skipped ( don't do anything ) if exactly same Document
already exists in Index.
EDIT - I asked a long question but after reading SO for few Lucene related questions, I realize that I am simply asking for incremental indexing approach while focusing on duplicate avoidance provided a document represents a row of a RDBMS table having a primary key. If DB row is changed, update document otherwise not and add docs for new rows.
id
field identify a document. If there is no unique id, then you need to check with your own logic before import. – user218867id
ofDocument
. In this case, all theFields
combined identify aDocument
. I guess, I need to define something likenew Term(...,...)
– Sabir KhanSolrInputDocument
, and set id field viadoc.setField("id", ...);
– user218867SolrJ
but Lucene Core , `<dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-core</artifactId> <version>6.0.0</version> </dependency>' – Sabir KhanLucene
, you can define a fieldid
as the unique key, which should come from your data model, e.g primary key in your database. When update a document, need delete the old doc by id, and insert a new doc. I suggest you to use SolrJ, which has feature like delta import that would save a lot of work. – user218867