0
votes

I'm new to Camel and the lack of similar questions online leads me to believe Im doing something silly. I am using camel 2.12.1 components and am parsing large CSV files from both local directories and by downloading them over SFTP. I've found that

split(body().tokenize("\n")).streaming().unmarshal().csv()

works for local files (windows 7); I get multiple exchanges with a

List<List<String>>

for each line in the csv file. But when I use that same route syntax from an sftp component (connecting to a linux server to download the files), I get a single exchange with a single line that reads like a call to "ls":

-rwxrwxrwx 1 userName userName 83400 Dec 16 14:11 fileName.csv

Through trial and error, I found that

split(body()).streaming().unmarshal().csv()

with the sftp component will correctly load and parse the file, but it doesn't do it in streaming mode, it loads the entire file into memory before unmarshalling it into a single exchange.

I found a similar bug report (https://issues.apache.org/jira/browse/CAMEL-6231) from camel 2.10 which Clause closed as Invalid indicating the reporter was using thread and parallel with stream incorrectly, but Im not configuring either of those capabilities.

The sftp stanza im using is:

sftp://192.168.1.1?fileName=fileName.csv&amp;username=userName&amp;password=secret!&amp;idempotent=true&amp;localWorkDirectory=tmp

The file stanza is :

"file:test/data?noop=true&amp;fileName=fileName.csv"

Anyone have an idea on what im doing wrong ?

2
I don't know how to properly fix this. We work around this by downloading (ftp, http) all file resources into a buffer directory and then use the file component to stream, split, and parallel process the contents. Then, starting with the file-route, it also does not matter if we poll for a particular file or it is pushed to us. - Ralf
Thanks Ralf, glad to know someone else is having a similar issue. It crossed my mind to go the route you've taken, but I'd like to better understand why its happening first. - Sinsanator
Maybe another reason why it might be beneficial to decouple downloading and processing is that the FTP component is single threaded. If you decouple downloading and processing then you can start with the processing while your next file is being downloaded. - Ralf
That is a valid observation and to be fair, my posted code uses the localWorkingDirectory, but for my real use case, Im dealing with many, very large log files, so streaming them from a remote location instead of copying them to a local directory would lessen my disk storage requirements and I could always scale up by kicking off more sftp sessions. - Sinsanator
Yet another suggestion for a workaround... Could you mount the remote site via SSH File System and then use the file2 component to stream the log files? I have no experience with SSHFS, let alone in combination with the file component. So I am not sure what mileage you could get from that. - Ralf

2 Answers

0
votes

make an intermediate route to solve the problem.

    <route id="StagingFtpFileCopy">
        <from uri="ftp://{{uriFtpPath}}"/>
        <to uri="file://data/staging"/>
    </route>
0
votes

I faced the same issue as well with SFTP (Camel 2.25.0). However, before splitting the route into 2 different routes (as proposed by others), I used the below url

sftp://:22/?username=random&password=random&delay=2000&move=archive&readLock=changed&bridgeErrorHandler=true&recursive=false&disconnect=true&stepwise=false&streamDownload=true&localWorkDirectory=C:/temp

with my below route definition,

from("sftp url").split().tokenize("\n", 10, true).streaming().to("log:out")

As this route also downloads the remote file into local (same as 2 route option) and then treat the local file with normal streaming (as Sinsanator, mentioned it works perfectly with file), the memory footprint becomes a truely saw-tooth while downloading (upto 100MB) and then it used upto 150MB while processing but roughly a saw-tooth nature again.

One advantage (in my view) with this approach could be we can handle the completion related task (e.g. moving the remote file into other directory) based on actual processing completion (which is not possible automatically if we break the routes). Also, since the downloading is managed by Camel, the local file gets deleted automatically on completion of processing.