Import CSV into Elasticsearch with Logstash

Question

I use the ELK stack to parse CSV files and send them to ElasticSearch after parsing them with logstash.

Unfortunately, I have a problem:

When I send my files to the listening directory of the "input" of my logstash pipeline, the records are doubled, see triplets, without my asking anything ...

Indeed :

This is what my pipeline looks like:

input {
  file {
    path => "/home/XXX/report/*.csv"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}
filter {
  csv {
      separator => ";"
     columns => ["Name", "Status", "Category", "Type", "EndPoint", "Group", "Policy", "Scanned At", "Reported At", "Affected Application"]
  }
}
output {
   elasticsearch {
     hosts => "http://localhost:9200"
     index => "malwarebytes-report"
  }
stdout {}
}

When I send my first file containing 28 records in "/home/XXX/report/", this is what ElasticSearch says:

[root @ lrtstfpe1 confd]#curl -XGET 'localhost:9200/_cat/indices?v&pretty'
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open malwarebytes-report PO4g6rKRTb6yuMDb7i-6sg 5 1 28 0 25.3kb 25.3kb

So far so good, but when I send my second file of 150 records ...:

[root @ lrtstfpe1 confd]#curl -XGET 'localhost:9200/_cat/indices?v&pretty'
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open malwarebytes-report PO4g6rKRTb6yuMDb7i-6sg 5 1 328 0 263.3kb 263.3kb

The 150 recordings have been doubled and added to the first 28 ...

What's going on ??

Several days that I am stuck on the problem, I really need you ..

UPDATE :

You need to look in /etc/logstash/conf.d and see if there are any other config files there

The problem is that I only have one pipeline in this folder ... So:

I just completely uninstalled the ELK stack (rpm -e elasticsearch kibana logstash filebeat) as well as any ELK traces (rm -rf /var/lib/ELK/ var/log/ELK/ etc/default/ELK /usr/share/ELK ...) So, nothing anywhere.

I just reinstall everything:

rpm -ivh elasticsearch-6.2.3.rpm
rpm -ivh kibana-6.2.3-x86_64.rpm
rpm -ivh logstash-6.2.3.rpm

And start the services: service ELK restart

Then, in terms of configurations: /etc/elasticsearch.yml is completely by default. /etc/kibana.yml is completely by default. /etc/logstash.yml is completely by default.

Then, I put my one and ONLY pipeline named "pip.conf" in /etc/logstash/conf.d/ Its configuration:

input {
   file {
     path => "/home/report/*.csv"
     start_position => "beginning"
     sincedb_path => "/dev/null"
  }
}
filter {
  csv {
     separator => ";"
     columns => ["Name","Status","Category","Type","EndPoint","Group","Policy","Scanned At","Reported At","Affected Application"]
  }
}
output {
   elasticsearch {
     hosts => "http://localhost:9200"
     index => "malwarebytes-report"
  }
stdout{}
}

And finally, I launch my pipeline : I go into /usr/share/logstash and I execute :

bin/logstash -f /etc/logstash/conf.d/pip.conf

After few secondes, my pipeline is listening, and now, I put my file1.csv and my file2.csv into /home/report/.

file1.csv contains 28 records and file2.csv contains 150 records.

But now, when I check my index : curl -XGET 'localhost:9200/_cat/indices?v&pretty' My index "malwarebytes-report" contains 357 records ... (150x2 + 28x2 ...)

I don't understand NOTHING ....

The problem is probably that your sincedb_path is set to null so Logstash might re-read all files everytime. Can you try to set sincedb_path to something meaningful so it can remember what it has read already? Are you running Logstash several times? — Val
Hello ! I deleted the line about sincedb_path to replace by its default value, and it's the same behaviour... I don't run logstash several times. I launch my pipeline, and I let it listen to /home/XXX/*.csv, and during this time I send my files into this path .. — Wrest
Are you sure you don't have multiple logstash processes that are running at the same time (one that didn't stop/kill properly)? Can you verify first? — Val
I can't have multiple logstash process at the same time, but yes I'm sure, only one is running, see this picture : ibb.co/mmBx6c — Wrest
So now, when I send a simple file with 150 records, one index is created, but with 272 records... absolutly insane... — Wrest

MosheZada MosheZada · Accepted Answer · 2019-01-25T11:04:21

If you able to use other tools other than logstash in order to load files into elasticsearch you can you elasticsearch-loader.

I'm the author of moshe/elasticsearch_loader
I wrote ESL for this exact problem.
You can download it with pip:

pip install elasticsearch-loader

And then you will be able to load csv files into elasticsearch by issuing:

elasticsearch_loader --index incidents --type incident csv file1.csv

Import CSV into Elasticsearch with Logstash

1 Answers