The Larger purpose of what I have to develop is as follows :-
a) A dashboard, where apart from other features, users can upload documents (.pdf,.txt,.doc). All these documents go to a particular directory.
b) The users can also query all the documents tagged with a particular keyword.
Now, I wish to use Hadoop
to perform the tagging of documents.I aim to implement this by using a dictionary of selected words.Now a .txt
(or maybe a .doc
file as well) would be easy to process. However, from my understanding a .pdf
file cannot be directly processed. I have learnt how to use Apache PDFBox
. However I am not able to integrate these two, i.e. Hadoop and PDFBox. What I want to do is that my Map-Reduce Program, recieves as input the corpus of .txt/.pdf/.doc files, and before the functioning of Map starts, performes this conversion of pdf to txt
.
How do I go about it?? Am I thinking in the right direction?? Please Help.