I am setting up a spark cluster. I have hdfs data nodes and spark master nodes on same instances.
Current setup is 1-master (spark and hdfs) 6-spark workers and hdfs data nodes
All instances are same, 16gig dual core (unfortunately).
I have 3 more machines, again same specs. Now I have three options 1. Just deploy es on these 3 machines. The cluster will look like 1-master (spark and hdfs) 6-spark workers and hdfs data nodes 3-elasticsearch nodes
- Deploy es master on 1, extend spark and hdfs and es on all other. Cluster will look like 1-master (spark and hdfs) 1-master elasticsearch 8-spark workers, hdfs data nodes, es data nodes
My application is heavily use spark for joins, ml etc but we are looking for search capabilities. Search we definitely not needed realtime and a refresh interval of upto 30 minutes is even good with us.
At the same time spark cluster has other long running task apart from es indexing.
The solution need not to be one of above, I am open with experimentation if some one suggest. It would be handy for other dev's also once concluded.
Also I am trying with es hadoop, es-spark project but I felt ingestion is very slow if I do 3 dedicated nodes, its like 0.6 million records/minute.