0
votes

What is the best way to index Wikipedia Articles (which contain Geo locations lon/lat) in a Solr Server?

E.g. I have a given lon/lat Position and want to index all Wikipedia Articles around a distance of 60 kilometers.

I could download the whole Wikipedia Dump and write a application which tries to fetch all data in the xml within the given distance of the point. But the dump is about 40GB and this could take a long time. And I have the following condition: I want to keep the data up to date (They should be updated every 48 hours). Is there a partial wiki dump available (e.g. for every country) or a API / Application to use for this case?

1
so, what's your problem? what you already done? - Mysterion
@Mysterion I want to implement a Service like this: en.wikipedia.org/wiki/Special:Nearby with wikipedia data on my local SOLR server. But I need only data from a small region. E.g. 100 Kilometers from a given point (lon/lat). So I don't want to download the whole dump from wikipedia and digging in the 40GB XML-file to index some articles around a location (which includes understanding the whole syntax/structure of the file and where to search for the location of an article). Is there no better way? A partial dump to download (as with OSM for countries) or an API to call with JAVA? - Marian Lux

1 Answers

1
votes

Special:Nearby you mentioned in comments used to be powered by Solr, but it now uses Elasticsearch. The extension that provides geospatial search - GeoData - also supports MySQL-based searches which is more practical for small datasets. If you're interested specifically in Solr, you can look how it was done before I killed it because Elasticsearch is ohhh so much nicer.