Preface
I have automated and scripted the creation of individual .ktr files to handle the extraction and syncing of data between Source (MySQL) and Target (InfoBright) databases. One .ktr file is created for each table.
I have a set of 2 Jobs and 2 Transformations which make up a "run" to find the data sync .ktr files and queue them for execution.
Job 1 (entry point)
- Run Transformation to search target directory for files matching wildcard passed from command line
- For every row, run Job 2 (file looper)
- After the run is done, do some error checking, mailing, close out
Job 2 (file looper)
- Run Transformation to take the result and populate a variable with the filename
- Run ${filename} Transformation to perform syncing between MySQL and Infobright
- Perform some error checking, populate an error log, etc. Standard graceful failures and error logging
This all works perfectly. I can queue up 250+ .ktr files in my target directory, and kitchen gets through them in about 9-15 minutes, depending on the volume of data to sync
Problem
Pentaho doesn't appear to support the parallelization of this abstract looped execution of transformations. Jobs don't support output distribution like Transformations do. I've checked the Pentaho support forums, and posted on there with no response.
I'm looking to get 4 or 5 parallel threads going, each executing one of the queued results (gathered filenames). I'm hoping somebody here can provide some sort of insight into how I can achieve this, aside from manually globbing files with filename tags, and running the kitchen job 5 times, passing in the filename tags as a parameter.
(This doesn't really address the output result distribution issue, as it just runs 5 separate sequential jobs, and doesn't distribute the workload)
EDIT: Here's the post on the Pentaho forums with images, that might help to illustrate what I'm talking about: http://forums.pentaho.com/showthread.php?162115-Parallelizing-looped-job-step
Cheers