10
votes

I am trying to read files from a directory which contains many sub directories. The data is in S3 and I am trying to do this:

val rdd =sc.newAPIHadoopFile(data_loc,
    classOf[org.apache.hadoop.mapreduce.lib.input.TextInputFormat],
    classOf[org.apache.hadoop.mapreduce.lib.input.TextInputFormat],
    classOf[org.apache.hadoop.io.NullWritable])

this does not seem to work.

Appreciate the help

2
Have you tried just using textFile("s3n://<root_dir>/*") ? - Soumya Simanta
yes , I tried that, does not work - venuktan
Please post an example of how the directories are nested. There is probably a solution involving simple wildcards, like: s3n://bucket/*/*/*. - Nick Chammas
yes that works thank you. s3n://bucket/root_dir/*/*/* for year, month, date . But does something like this work: s3n://bucket/root_dir/*/data/*/*/* basically a directory in every sub directory ? - venuktan

2 Answers

14
votes

yes it works, took a while to get the individual blocks/splits though , basically a specific directory in every sub directory : s3n://bucket/root_dir/*/data/*/*/*

-1
votes

ok, try this :

hadoop fs -lsr
drwxr-xr-x   - venuktangirala supergroup          0 2014-02-11 16:30 /user/venuktangirala/-p
drwxr-xr-x   - venuktangirala supergroup          0 2014-04-15 17:00 /user/venuktangirala/.Trash
drwx------   - venuktangirala supergroup          0 2015-02-11 16:16 /user/venuktangirala/.staging
-rw-rw-rw-   1 venuktangirala supergroup      19823 2013-10-24 14:34 /user/venuktangirala/data
drwxr-xr-x   - venuktangirala supergroup          0 2014-02-12 22:50 /user/venuktangirala/pandora

-lsr lists recursively, then parse the ones that do not start with "d"