2
votes

I have recently upgraded my search code from lucene.net 2.9.4 to 3.0.3. I have noticed a change in the spatial packages and have updated my code accordingly. One drawback from the upgrade that I have noticed is much slower index times. Through process of elimination, I have been able to narrow the slowness down to the new spatial code that indexes the lat/long coordinates:

        public void AddLocation (double lat, double lng)
    {
        try
        {
            string latLongKey = lat.ToString() + "," + lng.ToString();
            AbstractField[] shapeFields = null;
            Shape shape = null;
            if (HasSpatialShapes(latLongKey))
            {
                shape = SpatialShapes[latLongKey];
            }
            else
            {
                if (this.Strategy is BBoxStrategy)
                {
                    shape = Context.MakeRectangle(DistanceUtils.NormLonDEG(lng), DistanceUtils.NormLonDEG(lng), DistanceUtils.NormLatDEG(lat), DistanceUtils.NormLatDEG(lat));
                }
                else
                {
                    shape = Context.MakePoint(DistanceUtils.NormLonDEG(lng), DistanceUtils.NormLatDEG(lat));
                }

                AddSpatialShapes(latLongKey, shape);
            }

            shapeFields = Strategy.CreateIndexableFields(shape);
            //Potentially more than one shape in this field is supported by some
            // strategies; see the javadocs of the SpatialStrategy impl to see.
            foreach (AbstractField f in shapeFields)
            {
                _document.Add(f);
            }
            //add lat long values to index too
            _document.Add(GetField("latitude", NumericUtils.DoubleToPrefixCoded(lat), Field.Index.NOT_ANALYZED, Field.Store.YES, 0f, false));
            _document.Add(GetField("longitude", NumericUtils.DoubleToPrefixCoded(lng), Field.Index.NOT_ANALYZED, Field.Store.YES, 0f, false));
        }
        catch (Exception e)
        {
            RollingFileLogger.Instance.LogException(ServiceConstants.SERVICE_INDEXER_CONST, "Document",string.Format("AddLocation({0},{1})", lat.ToString(), lng.ToString()), e, null);
            throw e;
        }
    }

With 2.9.4, I was able to index about 300,000 rows of data with lat/lng points in about 11 minutes. With this new spatial package it takes upwards of 5 hours (I've killed the test before it finishes so I don't have an exact timing for it). Here is the spatial context/strategy I am using:

   public static SpatialContext SpatialContext
   {
       get
       {
           if (null == _spatialContext)
           {
               lock (_lockObject)
               {
                   if(null==_spatialContext) _spatialContext = SpatialContext.GEO;
               }
           }
           return _spatialContext;
       }
   }

   public static SpatialStrategy SpatialStrategy
   {
       get
       {
           if (null == _spatialStrategy)
           {
               lock (_lockObject)
               {
                   if (null == _spatialStrategy)
                   {
                       int maxLength = 9;
                       GeohashPrefixTree geohashPrefixTree = new GeohashPrefixTree(SpatialContext, maxLength);
                       _spatialStrategy = new RecursivePrefixTreeStrategy(geohashPrefixTree, "geoField");                           
                   }
               }
           }
           return _spatialStrategy;
       }
   }

Is there something I am doing wrong with my indexing approach? I have cached the shapes that get created by the lat/lng points since I don't need a new shape for the same coordinates. It appears to be the CreateIndexableFields() method that is taking the most time during indexing. I've tried to cache the fields generated by this method to reuse but I can't create a new instance of the TokenStream from the cached field to use in a new Document (in lucene.net 3.0.3 the constructor for TokenStream is protected). I've lowered the maxLevels int to 4 in the spatial strategy but I haven't seen an improvement in indexing times. Any feedback would be greatly appreciated.

1
Why don't you ask directly to the Lucene.Net community user AT lucenenet.apache.org (lucenenet.apache.org)I4V
Thanks, I did look at that community, unfortunately there is not much info in the way of the latest spatial package. I was hoping some spatial4j devs might take a look at this.a.rod
Have you debugged this? Lucene uses exceptions for control flow and they slow down indexing or search....what I did, avoid them where possible (I.e. Pass in parameters in such a way exceptions are not thrown like no empty directory so lucene has to throw an exception the create the folder for directory). I actually changed the control flow not to use exceptions...advanced, but it improved performance of my application by 10xBart Czernicki
@BartCzernicki - I have a try/catch clause in the AddLocation() method and it doesn't log any errors. Do you mean to debug through into the spatial and/or lucene.net libraries? Do you think those libraries are handling exceptions which is causing a slow down? Out of curiosity, are you using RecursivePrefixTreeStrategy during indexing, and if so what are your timings like?a.rod
@a.rod, add your update as an answer so this question can be marked as answered.sisve

1 Answers

0
votes

UPDATE: I changed the SpatialStrategy to PointVectorStrategy and now my indexing times are back down to 11 minutes for about 300,000 documents. The key to this was caching the IndexableFields created by the shapes to used when adding to the Documents. PointVectorStrategy allows for this since it creates NumericFields for index. This is not possible with RecursiveTreePrefixStrategy because it creates Fields with TokenStreams for indexing. In Lucene.net 3.0.3, TokenStreams are not reusuable for indexing. Thanks for everyone for helping with this.