5
votes

I am trying to run my Pig script (which uses UDFs) on Amazon's Elastic Map Reduce. I need to use some static files from within my UDFs.

I do something like this in my UDF:

public class MyUDF extends EvalFunc<DataBag> {
    public DataBag exec(Tuple input) {
        ...
        FileReader fr = new FileReader("./myfile.txt");
        ...
    }
    public List<String> getCacheFiles() {
        List<String> list = new ArrayList<String>(1);
        list.add("s3://path/to/myfile.txt#myfile.txt");
        return list;
    }
}

I have stored the file in my s3 bucket /path/to/myfile.txt

However, on running my Pig job, I see an exception:

Got an exception java.io.FileNotFoundException: ./myfile.txt (No such file or directory)

So, my question is: how do I use distributed cache files when running pig script on amazon's EMR?

EDIT: I figured out that pig-0.6, unlike pig-0.9 does not have a function called getCacheFiles(). Amazon does not support pig-0.6 and so I need to figure out a different way to get distributed cache work in 0.6

1
Maybe you already know it, but for others, Amazon supports now Pig 0.6 and 0.9.1 aws.amazon.com/elasticmapreduce/faqs/#pig-7Jorge González Lorenzo

1 Answers

0
votes

I think adding this extra arg to the Pig command line call should work (with s3 or s3n, depending on where your file is stored):

–cacheFile s3n://bucket_name/file_name#cache_file_name

You should be able to add that in the "Extra Args" box when creating the Job flow.