Dumping csv logs files from windows server to ubuntu VirtualBox/hadoop/hdfs

Question

We are getting new files everyday from apps in the form of csv gets stored in windows server say c:/program files(x86)/webapps/apachetomcat/.csv each file having different data in it, So is there any hadoop component to transfer files from windows server to hadoop hdfs, I came across flume,kafka but not getting proper example, Can anyone shade light here.

So Each file have separate name and having size upto 10-20mb and the daily file count is more than 200 files, Once the files added to windows server the flume/kafka should able to put that files in hadoop, Later files are imported from HDFS processed by spark and moved to processed files to another folder in HDFS

More details please, size of files? What are you hoping to do with this data? — AM_Hawk

kashmoney kashmoney · Accepted Answer · 2016-11-30T21:56:04

Flume is the best choice. A flume agent (process) needs to be configured. A flume agent has 3 parts:

Flume source - Place where flume will look for new files. c:/program files(x86)/webapps/apachetomcat/.csv in your case.

Flume sink - Place where flume will send the files. HDFS location in your case.

Flume channel - Temporary location of your file before it is sent to sink. You need to use "File Channel" for your case.

Click here for an example.

Dumping csv logs files from windows server to ubuntu VirtualBox/hadoop/hdfs

2 Answers