Scikit SVM: create training dataset

Question

I'm using this site http://scikit-learn.org/stable/datasets/ (subtitle 5.5) to create my custom dataset for performing SVM with scikit. Summary of my day: I basically have no idea what I'm doing.

For my thesis I want to predict stock return direction, i.e. the output of SVM should be 1 (UP) or -1 (DOWN). At the moment I'm trying to figure out SVM with a random sample (because I do get how the tutorials work).

As on the mentioned website it says that each line takes the form <label> <feature-id>:<feature-value> <feature-id>:<feature-value>, I thought that the training set I provide should take the same formatting. Hence I created following training sample in Notepad++:

<1> <1>:<0>, <1>:<19260800>, <1>:<77.83>
<1> <2>:<-1>, <2>:<20110000>, <2>:<75.78>
<-1> <3>:<1>, <3>:<53306400>, <3>:<76.24>
<1> <4>:<0>, <4>:<61293500>, <4>:<78.00>
<-1> <5>:<-1>, <5>:<42649500>, <5>:<75.91>

For example, the second line:

<1> means that stock went up since the day before, <2> is the of the data of the second line, <-1> is a negative Twitter sentiment for that day for the specific firm, <20110000> is the stock volume for that day, <75.78> is the adjusted closing price of that day.

I hope you understand what I'm trying to say. And I hope even more somebody can help me out.

Thanks in advance!

unutbu unutbu · Accepted Answer · 2015-03-03T18:59:38

Take a look at the related links referenced in the docs:

Public datasets in svmlight / libsvm format: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

Faster API-compatible implementation: https://github.com/mblondel/svmlight-loader

If you click on the first link you'll find example data sets such as this one:

-1 3:1 11:1 14:1 19:1 39:1 42:1 55:1 64:1 67:1 73:1 75:1 76:1 80:1 83:1 
-1 3:1 6:1 17:1 27:1 35:1 40:1 57:1 63:1 69:1 73:1 74:1 76:1 81:1 103:1 
-1 4:1 6:1 15:1 21:1 35:1 40:1 57:1 63:1 67:1 73:1 74:1 77:1 80:1 83:1 
-1 5:1 6:1 15:1 22:1 36:1 41:1 47:1 66:1 67:1 72:1 74:1 76:1 80:1 83:1 
-1 2:1 6:1 16:1 22:1 36:1 40:1 54:1 63:1 67:1 73:1 75:1 76:1 80:1 83:1 
-1 2:1 6:1 14:1 20:1 37:1 41:1 47:1 64:1 67:1 73:1 74:1 76:1 82:1 83:1

So you don't need the brackets, <>. Just fill the file with a numeric label, and the pairs of numbers separated by a colon. There are no commas between the pairs.

Per the docs, you can then load the data set with

>>> from sklearn.datasets import load_svmlight_file
>>> X_train, y_train = load_svmlight_file("/path/to/train_dataset.txt")

Scikit SVM: create training dataset

1 Answers