We have a customer that's using a Google Search Appliance (GSA) for searching thousands of PDF files. The PDF files are located on a file share organized in sub folders. It regularly finds new files and adds them to its database.
GSA does not work well enough so now they need alternatives for it. For example, their GSA does not search in vertical text in PDFs properly. We've looked at Apache Lucene and Solr together with Tika and ExtractingRequestHandler.
I've got the Solr example up and running, and added a PDF file using curl which can be searched, even vertical text. Our customer wants the app to detect new files automatically; it would be nice if I could re-index the database every 15 minutes or maybe every hour.
So I'm thinking about making a shell script to find new files and add them or something like that. Maybe query Solr before adding the files to see if it's already in Solr. Would that make sense?
Also, is Solr even the right tool for what we want to do?