elasticsearch data increase & duplicate at each restart

Question

I'm using elasticsearch with angularjs and oracle on windows 7. it's working more & more finer ( thanks to stackoverflower help ). I have a problem with elasticsearch: the number of elements in my document is increasing and i don't know why/how. My oracle table indexed by elasticsearch contain 12010 elements, now i got 84070 elements in elastic document (frequently checked by curl _count): so it duplicate the data 7 times now. I re-indexed the table few days ago but i remove elasticsearch "data" folder before.

data seems to increase each time i restart windows.

Thanks for help.

This is how i install and index my data :

I do this only the first time :

unzip elastic in folder : D:\work\elasticsearch-1.3.1\
install web interface : >plugin -install mobz/elasticsearch-head
install jdbc : >plugin --install jdbc --url http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-river-jdbc/1.3.0.0/elasticsearch-river-jdbc-1.3.0.0-plugin.zip
copy "ojdbc6-11.2.0.3.jar" to "D:\work\elasticsearch-1.3.1\plugins\jdbc"
service.bat install
service.bat start

creating index

curl -XPOST 'localhost:9200/donnees'

mapping :

curl -XPUT 'localhost:9200/donnees/specimens/_mapping' -d '{
"specimens" : {
    "_all" : {"enabled" : true},
    "_index" : {"enabled" : true},
    "_id" : {"index": "not_analyzed", "store" : false},
    "properties" : {
        "O_OCCURRENCEID"                                : {"type" : "string",   "store" : "no","index": "not_analyzed"  } ,
            .... 
        "I_INSTITUTIONCODE"                             : {"type" : "string",   "store" : "yes","index": "analyzed" } 
    }
}}'

query oracle and index data :

curl -XPUT 'localhost:9200/_river/donnees_s/_meta' -d '{
 "type" : "jdbc",
 "jdbc" : {
    "index" : "donnees",
    "type" : "specimens",
    "url" : "jdbc:oracle:thin:@localhost:1523:recolnat",
     "user" : "user",
     "password" : "password",
     "sql" : "select * from all_specimens_data"
   }
}'

( is this correct ?? it doesn't work if i replace "curl -XPUT 'localhost:9200/_river/donnees_s/_meta'" by "curl -XPUT 'localhost:9200/donnees/specimens/_meta' which i use to query )

test :

curl -XGET 'http://localhost:9200/donnees/specimens/_count?q=*'
    => 12010
curl -XGET 'http://localhost:9200/donnees/specimens/_search?q=P00009359'
    => return data ok

Shouldn't you have an '_id' column in your select for proper identification of already loaded rows? — Konstantin V. Salikhov
the column "O_OCCURRENCEID" is a unique id in my database table. you talking about this line in the mapping: "_id" : {"index": "not_analyzed", "store" : false} ? — AlainIb
I mean something like select O_OCCURRENCEID as _id ....blahblah... — Konstantin V. Salikhov
no i have not but when i query elasticsearch there is one added by elastic : pastebin.com/zXNay7sX — AlainIb
It generates new _id field value each time you load your data, so jdbc river can't determine which records it have already loaded — Konstantin V. Salikhov

AlainIb AlainIb · Accepted Answer · 2014-10-06T06:58:42

Resolved thanks to Konstantin V. Salikhov.

Each time elasticsearch service start it query the database with the sql provided to the _river and get the data ( see me previous "query oracle and index data : "). If the data don't have an "_id" column _river can't determine which records it have already loaded and the data is duplicated each time. To avoid duplicate i edit my "all_specimens_data" table in database ( who is in fact a view to avoid modification o database) and rename "O_OCCURRENCEID" to "_id", "O_OCCURRENCEID" is my primary key UUID.

hope this help other

elasticsearch data increase & duplicate at each restart

1 Answers