2
votes

I am trying to import all of our orders data into elasticsearch using logstash jdbc input. The sql query involves joins across 7 tables and there is no indexing possible after joining and there is one to many relationship between tables. There are around 3M rows after joining.

I want to import the order data day by day as there will be less no of records instead of running the whole query against the complete orders till date and paginating them. If I am querying the whole data, database is struck at calculating the no of records itself.

How can I run the import day after day without stopping the logstash and configuring the SQL date parameters dynamically.

This is my current logstash job configuration

input {
jdbc { 
jdbc_connection_string => "jdbc:mysql://dbhostname:3306/mydatabase?zeroDateTimeBehavior=CONVERT_TO_NULL"
jdbc_user => "username"
jdbc_password => "password"
jdbc_driver_library => "/home/username/mysql-connector-java-8.0.11.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
statement => "select * from table1 t1, table2 t2 ,table3 t3 , table4 t4, table5 t5, table6 t6 where some_condition and some_other_condition and t1.created_at between 'date1_start' and 'date1_end'"
}
}
output {
stdout { codec => json_lines }
elasticsearch {
"hosts" => "localhost:9200"
"index" => "orderdata_prod"
"document_type" => "data"
}
}

I want to dynamically change the statement provided in the configuration with new date values. How can I achieve that? Or are there any alternate ways to import such data?

1
Is it so problematic to download entire history at once? I had to download over 20M records and it took logstash whole day but eventually downloaded them all. Downloading data day by day might even take more time. - Michael Dz
@MichaelDz I am worried if the server would crash due to data load. That's why I thought it would be good if we could run day by day, it could be processed faster. Now, we have increased the memory of the instance and started importing the data with the single query. But, I would like to know if it is possible to do at all. - BarathVutukuri

1 Answers

0
votes

I guess you are probably worried about potential crash on a production db but you can mitigate the risk using logstash's paging to break single query into multiple smaller queries. But you should run the import in one go because ordering between queries is not guaranteed.

If you want to limit each query to download 100 000 rows then add these two parameters in your logstsh conf file:

jdbc_paging_enabled => true
jdbc_page_size => 100000

You can also try with fetch size parameter by lowering it in comparison to db's default value but be aware that this parameter is just a hint for jdbc and might not work sometimes. Oracle for example has default fetch size set to 10 so you could lower it to let's say 5:

jdbc_fetch_size => 5