clojure: parallel processing using multiple computers

Question

i have 500 directories, and 1000 files (each about 3-4k lines) for each directory. i want to run the same clojure program (already written) on each of these files. i have 4 octa-core servers. what is a good way to distribute the processes across these cores? cascalog (hadoop + clojure)?

basically, the program reads a file, uses a 3rd party Java jar to do computations, and inserts the results into a DB

note that: 1. being able to use 3rd party libraries/jar is mandatory 2. there is no querying of any sorts

till now i had been processing a single directory at a time on 1 server using "pmap". but to process 500 directories, i need to scale horizontally — Pradnyesh Sawant
@shawn-zhang thanks for the suggestion, i'll look it up. however, i have never done "big-data" processing and don't know many options other than the most popular ones like hadoop, or found from basic search like cascalog, etc — Pradnyesh Sawant

Arthur Ulfeldt Arthur Ulfeldt · Accepted Answer · 2015-01-02T21:23:34

Because there is no "reduce" stage to your overall process as I understand it, it makes sense to put 125 of the directories on each server and then spend the rest you time trying to make this program process them faster. Up to the point where you saturate the DB of course.

Most of the "big-data" tools available (Hadoop, Storm) focus on processes that need both very powerful map and reduce operations, with perhaps multiple stages of each. Your case all you really need is a decent way to keep track of which jobs passed and which didn't. I'm as bad as anyone (and worse than many) at predicting development times, though in this case I'd say it would an even chance that rewriting your process on one of the map-reduce-esque tools will take longer than adding a monitoring process to keep track of which jobs finished and which failed so you can rerun the failed ones later (preferably automatically).

clojure: parallel processing using multiple computers

2 Answers