1
votes

We have a customer that's using a Google Search Appliance (GSA) for searching thousands of PDF files. The PDF files are located on a file share organized in sub folders. It regularly finds new files and adds them to its database.

GSA does not work well enough so now they need alternatives for it. For example, their GSA does not search in vertical text in PDFs properly. We've looked at Apache Lucene and Solr together with Tika and ExtractingRequestHandler.

I've got the Solr example up and running, and added a PDF file using curl which can be searched, even vertical text. Our customer wants the app to detect new files automatically; it would be nice if I could re-index the database every 15 minutes or maybe every hour.

So I'm thinking about making a shell script to find new files and add them or something like that. Maybe query Solr before adding the files to see if it's already in Solr. Would that make sense?

Also, is Solr even the right tool for what we want to do?

1
Personally I've recently made something like you are looking for. This is a simple Java application based on SolrJ which could index like every night, which would be most efficiënt I suppose. However you could extend the application by adding caching-like capabilities to a small database (hash based or something) to optimize it.Alex van den Hoogen
Hi Alex, that looks neat. Thanks for that! I think it's very close.Simon Fredsted
I just updated my app with support for delta updates. Although I haven't tested it yet on a large file set.Alex van den Hoogen
Really nice App @AlexvandenHoogen. I think a first speed-up is to make SolrClient (or SolrServer, if using las than 5.X version of Solr) as class field, so you dont need to re-create it every time you have to push a new file.Mistre83
@Mistre83 Thank you. Sure enough, and not committing everytime I insert a new document into the index would increase performance greatly too. However, at the moment I'm not activly developing the app because of other commitments. Probably in the next few months I could do an updated version.Alex van den Hoogen

1 Answers

0
votes

What you are talking about is "delta indexing". So only newly added or changed documents are indexed. You should read the Solr-Documentation for more information about this.