0
votes

The Larger purpose of what I have to develop is as follows :-

a) A dashboard, where apart from other features, users can upload documents (.pdf,.txt,.doc). All these documents go to a particular directory.

b) The users can also query all the documents tagged with a particular keyword.

Now, I wish to use Hadoop to perform the tagging of documents.I aim to implement this by using a dictionary of selected words.Now a .txt (or maybe a .doc file as well) would be easy to process. However, from my understanding a .pdf file cannot be directly processed. I have learnt how to use Apache PDFBox. However I am not able to integrate these two, i.e. Hadoop and PDFBox. What I want to do is that my Map-Reduce Program, recieves as input the corpus of .txt/.pdf/.doc files, and before the functioning of Map starts, performes this conversion of pdf to txt.

How do I go about it?? Am I thinking in the right direction?? Please Help.

1
I'm not sure where Hadoop would come into play here, but if you're aiming to index and query a corpus of documents maybe you're looking for Apache Lucene/Solr?jkovacs

1 Answers

0
votes

I presume you have million or billion of document, and you want to classify them, Whether it is pdf, txt, doc and so on...But your actual problem is how to use Apache PDFBox in Mapper, Right Here is link How to load user Library hadoop: LINK