2
votes

I am creating a Lucene Index for values got from database. I have set Index OpenMode as OpenMode.CREATE_OR_APPEND.

Index creation step is part of a Spring Batch Job.

My understanding is that when I run job for the first time, indexing might take a while but when I rerun the job again for same unchanged source data, it should be fast because document is already there so UPDATE OR INSERT has not be performed.

But for my case, subsequent indexing attempts for same unchanged source data gets slower and slower.

Answer to this question says that it will be handled automatically based on a term.

I am not sure as how to I define the term in my case to handle this?

Below is my sample code,

        public Integer createIndex(IndexWriter writer, String str, LuceneIndexerInputVO luceneInputVO) throws Exception {
            Integer count = 0;
            Document d = null;
            txtFieldType.setTokenized(false);
            strFieldType.setTokenized(false);

            List<IndexVO> indexVO = null;

            indexVO = jdbcTemplate.
                    query(Constants.SELECT_FROM_TABLE1, 
                            new Object[] {luceneInputVO.getId1(), luceneInputVO.getId2(), str}, 
                            new IndexRowMapper());

            while (!indexVO.isEmpty()) {
                d = new Document();
                d.add(getStringField(Constants.ID, String.valueOf(luceneInputVO.getId())));
                .....
                ....
                writer.addDocument(d);
                indexVO.remove(indexVO.get(count));
                count++;
            }
            return count;
        }

What should I change in above code to not perform indexing when there is no change in source data?

I am a beginner to Lucene and not sure as how to define that Term which would decide about duplicity.

I don't want indices to be recreated and I wish new Document to be skipped ( don't do anything ) if exactly same Document already exists in Index.

EDIT - I asked a long question but after reading SO for few Lucene related questions, I realize that I am simply asking for incremental indexing approach while focusing on duplicate avoidance provided a document represents a row of a RDBMS table having a primary key. If DB row is changed, update document otherwise not and add docs for new rows.

Question 1,Question 2

1
id field identify a document. If there is no unique id, then you need to check with your own logic before import.user218867
how do I code the id of Document. In this case, all the Fields combined identify a Document. I guess, I need to define something like new Term(...,...)Sabir Khan
No, if you are using SolrJ, to specify id of a document, just create a SolrInputDocument, and set id field via doc.setField("id", ...);user218867
I am not using SolrJ but Lucene Core , `<dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-core</artifactId> <version>6.0.0</version> </dependency>'Sabir Khan
With bare Lucene, you can define a field id as the unique key, which should come from your data model, e.g primary key in your database. When update a document, need delete the old doc by id, and insert a new doc. I suggest you to use SolrJ, which has feature like delta import that would save a lot of work.user218867

1 Answers

6
votes

I have verified that in Lucene 6.0.0 , IndexWriter.updateDocument(Term term,Document doc); adds a new Document if document doesn't already exist and updates existing Document if found as per term.

For my requirement, I defined a key field which is basically a concatenation of all other value fields for Document. This way key identifies content wise duplicates i.e. for two documents having same key means that documents are content wise duplicates.

I construct term to be passed to IndexWriter.updateDocument(Term term,Document doc); on this key value and just calling IndexWriter.updateDocument(Term term,Document doc); instead of IndexWriter.addDocument(Document doc) solves issue.