4
votes

I'm reading about 7 million lines of data and it takes close to two minutes to load everything up when I restart my application. I'm trying to figure out the best way to speed things up so that it only takes a few seconds at most to restart the application. Here is the code I'm looking to speed up and the amount of time it currently takes:

// Creating data model - this takes about 1.77 minutes
DataModel datamodel = new FileDataModel(new File("datafile"));

// ItemSimilarity object - this takes about 1 millisecond
ItemSimilarity similarity = new TanimotoCoefficientSimilarity(datamodel);

// Recommender - this takes about 3 milliseconds
ItemBasedRecommender irecommender = new GenericBooleanPrefItemBasedRecommender(datamodel, similarity);

// List of Recommendations - this takes about 365 milliseconds
List<RecommendedItem> irecommendations = irecommender.mostSimilarItems(item, amount);

I'm wondering if:

  • there is a way to output the DataModel to another file so that I could just read it in instead of having to create it every time?
  • If that is possible then is it possible to output the data from the ItemSimilarity to another file and just read that in instead of creating the DataModel and calculating the ItemSimilarity every time?
1

1 Answers

2
votes

Your first question

there is a way to output the DataModel to another file so that I could just read it in instead of having to create it every time?

Yes, you could serialize it. BUT be aware of the potential issues with serializing (see http://www.ibm.com/developerworks/library/j-5things1/). You might find that there are some speed improvements, but they might not as dramatic as you think.

Another option is to create a database to store the data you want to load in. Once you have stored the data you can simply load it into memory when you start your project. Using this approach you would see the first run of your application being slow (as your function takes the data from the database and stores it in memory). Every operation against the data would then be quick since it is in memory.

You could use an in-memory db such as HSQLDB, a relational db or a object db - which ever you are happy with. I'd probably look at a object db - you could load the objects straight into memory which might be faster than having the data in a relational db since you would have to map each individual field across into the object variable on creation.

Your second question

If that is possible then is it possible to output the data from the ItemSimilarity to another file and just read that in instead of creating the DataModel and calculating the ItemSimilarity every time?

You could also serialize this too. Again you need to be careful when using serialize to persist data. Also worth considering persisting this into a database then loading it into memory when your programme first starts.

Can't give the best option without knowing more about your programme. Is it for production, research, or just messing about. Try serialization first, but note any interaction with disk either in db or serialization is slow.

Hope that helps.