How to process 50 k files received over ftp in every 10 seconds

Question

I have 50k machines and each machine is having a unique id. every 10 seconds machine will send a file in machine_feed directory located in ftp server.Not all files are received at same time.

Machine will create file with it's id name. I need to process all received files. If file is not processed in short time then machine will send new file that will override existing file and i will loose existing data.

My Solution is

I have created spring boot application contains one scheduler that execute every 1 millisecond, that will rename received file and will copy it to processing dir. current date time will be appended to each file.

I have one more job written in apache camel that will poll received file from processnig location for every 500 milisecond and will process it and insert data in DB.if error is received then it will move file in error dir.

File is not big. It contains only one line of information.

Issue is if files are less then it is doing great job. If files are increasing then though file is valid it is moving in error folder.

when camel is polling file then found zero length file and after that file is copied to error directory then it contains valid data. Some how camel is polling file that is not copied completely.

Anyone know good solution for this problem?.

Thanks in advance.

Maybe your spring-boot application should copy file to another temporary dir first and after copy is done, move it to your processing dir. With this approach should not happen, that the file is written only partialy, because move is on most platforms atomic operation and copy isn't. Make sure to have your temporary and processing dir on the same drive to guarantee atomicity of move — Bedla
Or on camel side use readLock=changed with readLockMinLength or readLockMinAge — Bedla
FTP isn't a messaging protocol and might not handle more than tens of files per second. I suggest using a messaging solution. — Peter Lawrey

Lalit Lalit · Accepted Answer · 2018-07-22T18:32:08

I've faced a similar problem before but I used a slightly different set of tools...

I would recommend taking a look at Apache Flume - it is a lightweight java process. This is what I used in my situation. The documentation is pretty decent so you should be able to find your way but I just thought of giving a brief introduction anyway just to get you started.

Flume has 3 main components and each of these can be configured in various ways:

Source - The component responsible for sourcing the data
Channel - Buffer component
Sink - This would represent the destination where the data needs to land

There are other optional components as well such as Interceptor - which is primarily useful for intercepting the flow and carrying out basic filtering, transformations etc.

There is wide variety of options to choose from for each of these but if none of the ones available suit your use case - you could write your own component.

Now, for your situation - following are a couple of options I could think of:

Since your file location needs almost continuous monitoring, you might want to use Flume's Spooling Directory Source that would continuously watch your machine_feed directory and pick it up as soon as the file arrives (You could choose to alter the name yourself before the file gets overwritten).

So, the idea is to pick up the file and hand it over to the processing directory and then carry on with the processing with Apache Camel as you are already doing it.

The other option would be (and this is the one I would recommend considering) - Do everything in one Flume agent.

Your flume set-up could look like this:

Spooling Directory Source
One of the interceptors (Optional: for your processing before inserting the data into the DB. If none of the available options are suitable - you could even write your own custom interceptor)
One of the channels (Memory channel - May be...)
Lastly, one of the sinks (This might just need to be a custom sink in your case for landing the data in a DB)

If you do need to write up a custom component (an interceptor or a sink), you could just look at the source code of one the default components for reference. Here's the link to the source code repository.

I understand that I've gone in a slightly different tangent by suggesting a new tool altogether but this worked magically for me as the tool is a very light weight tool with a fairly straightforward set up and configuration.

I hope this helps.

How to process 50 k files received over ftp in every 10 seconds

1 Answers