1
votes

I have a very basic index with 2 fields - a numeric ID field and a ~60 to 100 character long string field. The string field contains DNA sequences, with no spaces.

So for example, a given field value is: AATCTAGATACGAGATCGATCGATCGATCGATCGATCGATGCTAGC

and a searchstring would be something like: GATCGATCGA

There are over 7 million rows, and the index comes in at about 1GB.

I am storing the index in azure blob storage, and running a simple web app that queries the index on a B1 web app instance.

No matter what I do, regardless of the size of the search string, I cannot get the operation to run faster than 20-21 seconds.

I've tried scaling up to a B3 instance, but still its coming in at 20 seconds.

I've isolated the bottleneck to when the query is run against IndexSearcher.

To search, I attach a wildcard to the beginning and end of my searchstring.

My code is as follows

    var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
    var parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_30, "nucleotide", new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30));
    parser.AllowLeadingWildcard = true;
    var storageAccount = CloudStorageAccount.Parse("connection info");
    var azureDir = new AzureDirectory(storageAccount, "myindex", new RAMDirectory());
    IndexSearcher searcher = new IndexSearcher(azureDir, true);
var query = parser.Parse("*" + mystring + "*");    
TopDocs hits = searcher.Search(query, 50);
1
Your search for *term* requires a complete iteration of all terms, which will take forever. I am guessing you have a lot of terms. Could you perhaps split it up during indexing to GA+TC+... and then search those using positional information? (Like a PhraseQuery)sisve
I know stating the obvious but starting a search with a wild card is going to be slow. That is still 350 / millisecond and given it has to touch every row that is not bad. Maybe try a map to byte and regex but still not sure that would beat 350 / millisecondpaparazzo
*x* will always be a slow search. I'm pretty impressed it comes back that quickly, considering.Greg D

1 Answers

0
votes

Not at all a Lucene answer
Once you have seen it just comment and I will delete it
But this would be a SQL solution

Table   
int32    seqID   
tinyint  pos   
char(1)  value   

with the first two as a composite PK

then you you just build up the query

select distinct t1.seqID 
  from table t1 
  join table t2 
          on t2.seqID = t1.seqID 
         and t2.pos   = t1.pos + 1  
         and t1.val   = 'val1'
         and t2.val   = 'val2' 
  join table t3 
          on t3.seqID = t1.seqID 
         and t3.pos   = t1.pos + 2  
         and t3.val   = 'val3' 
  join table t4 
          on t4.seqID = t1.seqID 
         and t4.pos   = t1.pos + 3  
         and t4.val   = 'val4'  
   ...

I know that may look crazy but SQL has an index to all those joins and should filter early (not touch all rows). Yes it touches all rows but via an index char by char and should give up as soon as the char by char fails.

Like I said in a comment I would also try a brute force regex but I doubt that will beat 350/ms as it has to touch all rows. And you could not map to byte like I said in the comment as regex is a text search

To other option is a DNA class that uses a byte array internally with a Like method that does a byte compare but I also doubt that will beat 350 / ms.
byte array pattern match