Good morning all, I am currently on a project in the field of Machine Learning, the goal is to make a supervised classification on a set of data. My data is a large number of pdf files, each file has a specific class, the goal is to use these files as a training dataset in order to do class prediction on new files. My problem is that I don't know how to build my training dataset since the classification algorithm must train on the content of each file and in my training data frame I have the class of each file and the name of the file in question. How do I include the content of each pdf file in my training Data Frame? Thank you in advance for your help
1 Answers
0
votes
PDF files are usually characterized by text, images, charts or whatever, and so they cannot be easily transformed into vectors of numbers that can be given to a machine learning algorithm. First you need to extract information of interest from your files.
In this regard, you might want to try first some libraries which can be used to extract information, and see what happens. For Python, a good start can be PyPDF2. You can find a tutorial here. If this is does not work as expected, my advice would be to try to use some OCR tools, which directly read the pdf as an image to extract information. In pytesseract is one of the most used, but it is not the only one.