4
votes

I want to represent a set of labelled instances (data) in a file to be fed in to LibSVM as training data. For the problem mentioned in this question. It will include,

  1. Login date
  2. Login time
  3. Location (country code?)
  4. Day of the week
  5. Authenticity (0 - Non Authentic, 1 - Authentic) - The Label

How can I format this data to be input to the SVM?

2
Location and IP address overlap, so you might want to pick only one of them (experiment to find out which is best). - Fred Foo
You don't need to remove it from the question :) - Fred Foo

2 Answers

4
votes

Are you asking about the data format or how to convert the data? For the latter you're going to have to experiment to find the right way to do this. The general idea is to convert your data into a nominal or ordinal value attribute. Some of these are simple - #4, #6 - some of these are going to be tough - #1-#3.

For example, you could represent #1 as three attributes of day, month and year, or just one by converting it to a UNIX like timestamp.

The IP is even harder - there's no straightforward way to convert that into a meaningful ordinal value. Using every IP as a nominal attribute might not be useful depending on your problem.

Once you figure this out, convert your data, check the LibSVM docs. The general format is followed by : i.e., +1 1:0 2:0 .. etc

1
votes

I believe there is an unstated assumption in the previous answers. The unstated assumption is that users of libSVM know that they should avoid putting categorical data into the classifier.

For example, libSVM will not know what to do with country codes. If you are trying to predict which visitors are most likely to buy something on your site then you could have problems if USA is between Chad and Niger in your country code list. The bulge from USA will likely skew predictions for the countries located near it.

To fix this I would create one category for each country under consideration (and perhaps an 'other' category). Then for each instance you want to classify, I would set all the country categories to zero except the one to which the instance belongs. (To do this with the libSVM sparse file format, this isn't really a big deal).