2
votes

I have this web application that users (mainly English learners or children) can search for some existing licensed articles in my database. They can be filtered by categories, tags and how difficult each one is.

So I am thinking of adding articles from Wikipedia to the database and be able to update the articles in my database once in a while but I am not sure what would be the best way for that. My understanding is that I need to download compressed files every time and then decompress them so I will get articles in XML format. Then I can add them to the database according to the tags? Is there a way I can have it update automatically? I read the article but on data dumps but not sure how to get started.

http://en.wikipedia.org/wiki/Wikipedia:Database_download#SQL_schema

1
You will need a User Account first, just incase you didnt know ;). Secondly, you will need to gain access to their API Web Service, at which point you will need to handle the XML result they return through the call. - GoldBishop
@GoldBishop You can use Wikipedia's API even without an account, just in case you didn't know. And the dumps Ruby mentioned are not related to the API in any way. - svick
@svick Without the Account, dont you have to get an Authorization Cookie? WIth the account you just pass in your unique account id with another authentication string and you can do it all from your desktop. - GoldBishop
@GoldBishop I'm not completely sure what you're talking about, but no, you don't have to do anything special if you don't have an account. And I have no idea how is it related to my desktop or what “another authentication string” is (it certainly doesn't have anything to do with the Wikipedia API). - svick
@svick just wondering cause i had to post some authentication strings on other mediawiki implementations and just thought that the same was true with wikipedia, my mis-information. - GoldBishop

1 Answers

-2
votes

Perhaps it would be better to merely crawl and index Wikipedia. Then you can store a search index with the pages you care about in a system such as Apache Solr. If you do that, be sure to be polite about the rate of your requests,

That avoids the storage and requires no effort to get the content updated. Only the links need to be updated (probably much less frequently).

If you don't wish to filter what people find, then you could probably just sign up for Google's search API and save the crawler time/effort...