I have a very basic index with 2 fields - a numeric ID field and a ~60 to 100 character long string field. The string field contains DNA sequences, with no spaces.
So for example, a given field value is: AATCTAGATACGAGATCGATCGATCGATCGATCGATCGATGCTAGC
and a searchstring would be something like: GATCGATCGA
There are over 7 million rows, and the index comes in at about 1GB.
I am storing the index in azure blob storage, and running a simple web app that queries the index on a B1 web app instance.
No matter what I do, regardless of the size of the search string, I cannot get the operation to run faster than 20-21 seconds.
I've tried scaling up to a B3 instance, but still its coming in at 20 seconds.
I've isolated the bottleneck to when the query is run against IndexSearcher.
To search, I attach a wildcard to the beginning and end of my searchstring.
My code is as follows
var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
var parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_30, "nucleotide", new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30));
parser.AllowLeadingWildcard = true;
var storageAccount = CloudStorageAccount.Parse("connection info");
var azureDir = new AzureDirectory(storageAccount, "myindex", new RAMDirectory());
IndexSearcher searcher = new IndexSearcher(azureDir, true);
var query = parser.Parse("*" + mystring + "*");
TopDocs hits = searcher.Search(query, 50);
*term*
requires a complete iteration of all terms, which will take forever. I am guessing you have a lot of terms. Could you perhaps split it up during indexing to GA+TC+... and then search those using positional information? (Like a PhraseQuery) – sisve*x*
will always be a slow search. I'm pretty impressed it comes back that quickly, considering. – Greg D