Pig - load Word documents (.doc & .docx) with pig

Question

I can't load Microsoft Word documents (.doc or .docx) with pig. Indeed, when i try to do so, by using TextLoader(), PigStorage() or no loader at all, it doesn't work. The output is some weird symbols.

I heard that I could write a custom loader in JAVA but it seems really difficult and I don't underdstand how we can program one of these at the moment.

I would like to put all the .doc file content in a single chararray bag so I could later use a filter function to process it.

How could I do ?

Thanks

mr2ert mr2ert · Accepted Answer · 2013-08-29T17:01:33

They are right. Since .doc and .docx are binary formats, simple text loaders won't work. You can either write the UDF to be able to load the files directly into Pig, or you can do some preprocessing to convert all .doc and .docx files into .txt files so that Pig will be loading those .txt files instead. This link may help you get started in finding a way to convert the files.

However, I'd still recommend learning to write the UDF. Preprocessing the files is going to add significant overhead that can be avoided.

Update: Here are a couple of resources I've used for writing my java (Load) UDFs in the past. One, Two.

Pig - load Word documents (.doc & .docx) with pig

1 Answers