0
votes

We are getting new files everyday from apps in the form of csv gets stored in windows server say c:/program files(x86)/webapps/apachetomcat/.csv each file having different data in it, So is there any hadoop component to transfer files from windows server to hadoop hdfs, I came across flume,kafka but not getting proper example, Can anyone shade light here.

So Each file have separate name and having size upto 10-20mb and the daily file count is more than 200 files, Once the files added to windows server the flume/kafka should able to put that files in hadoop, Later files are imported from HDFS processed by spark and moved to processed files to another folder in HDFS

2
More details please, size of files? What are you hoping to do with this data? - AM_Hawk

2 Answers

1
votes

Flume is the best choice. A flume agent (process) needs to be configured. A flume agent has 3 parts:

Flume source - Place where flume will look for new files. c:/program files(x86)/webapps/apachetomcat/.csv in your case.

Flume sink - Place where flume will send the files. HDFS location in your case.

Flume channel - Temporary location of your file before it is sent to sink. You need to use "File Channel" for your case.

Click here for an example.

0
votes

As per my comment, more details would help narrow down possibilities, example first thought, move file to server and just create a bash script and schedule with cron.

put

Usage: hdfs dfs -put <localsrc> ... <dst>

Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system.

hdfs dfs -put localfile /user/hadoop/hadoopfile
hdfs dfs -put localfile1 localfile2 /user/hadoop/hadoopdir
hdfs dfs  -put localfile hdfs://nn.example.com/hadoop/hadoopfile
hdfs dfs  -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
Exit Code:

Returns 0 on success and -1 on error.