spring batch design advice for processing 50k files

Question

We have more than 50k files coming in everyday and needs to be processed. For that we have developed POC apps with design like,

Polling app picks the file continuously from ftp zone.
Validate that file and create metadata in db table.
Another poller picks 10-20 files from db(only file id and status) and deliver it to slave apps as message
Slave app take message and launch a spring batch job, which is reading data, does biz validation in processors and writes validated data to db/another file.

We used spring integration and spring batch technology for this POC

Is it a good idea to launch spring batch job in slaves or directly implement read,process and write logic as plan java or spring bean objects?

Need some insight on launching this job where slave can have 10-25 MDP(spring message driven pojo) and each of this MDP is launching a job.

Note : Each file will have max 30 - 40 thousand records

Hansjoerg Wingeier Hansjoerg Wingeier · Accepted Answer · 2015-09-14T06:08:04

Generally, using SpringIntegration and SpringBatch for such tasks is a good idea. This is what they are intended for.

With regard to SpringBatch, you get the whole retry, skip and restart handling out of the box. Moreover, you have all these readers and writers that are optimised for bulk operations. This works very well and you only have to concentrate on writing the appropriate mappers and such stuff.

If you want to use plain java or spring bean objects, you will probably end up developing such infrastructure code by yourself... incl. all the needed effort for testing and so on.

Concerning your design: Besides validating and creation of the metadata entry, you could consider to load the entries directly into a database table. This would give you a better "transactional" control, if something fails. Your load job could look something like this:
step1:
tasklet to create an entry in metadata table with columns like

FILE_TO_PROCESS: XY.txt

STATE: START_LOADING

DATE: ...

ATTEMPT: ... first attempt

step2:
read and validate each line of the file and store it in a data table

DATA: ........

STATE:

FK_META_TABLE: ForeignKey to meta table

step3:
update metatable with status LOAD_completed

-STATE : LOAD_COMPLETED

So, as soon as your metatable entry gets the state LOAD_COMPLETED, you know that all entries of the files have been validated and are ready for further processing. If something fails, you just can fix the file and reload it.

Then, to process further, you could just have jobs which poll periodically and check if there are new data in the database which should be processed. If more than one file had been loaded during the last period, simply process all files that are ready. You could even have several slave-processes polling from time to time. Just do a read for update on the state of the metadata table or use an optimistic locking approach to prevent several slaves from trying to process the same entries.

With this solution, you don't need a message infrastructure and you can still scale the whole application without any problems.

spring batch design advice for processing 50k files

1 Answers