Text classification using java libsvm - model, parameter selection

Question

I'm trying to use the direct java port of libsvm for text classification.

Currently have bad accuracy and following this guide to tune model. At this stage have 2 doubts.

I do scaling by taking termcount/totalterms_in_doc. Will this be adequate?
When running svm.svm_cross_validation / svm.svm_train methods I get following output, what does it mean, how do I infer gamma and C from this?

optimization finished, #iter = 1504 nu = 0.5800464037122964 obj = -299.9624358558652, rho = -0.9799716681242028 nSV = 3000, nBSV = 3000 Total nSV = 3000 version of libsvm=3.2, java=1.7

rzo1 rzo1 · Accepted Answer · 2015-04-11T14:42:01

I assume you proceed for text classification as described in [1].

To obtain your feature values (simple scaling, no scaling, what ever you want), you have various possibilities [2].

TFC method works quite well [1], for details [2]:

TermFrequency * log(N/n) * 1 / sum((TermFrequency_i * log(N/n_i))^2)

with

sum => every term in your document collection; 
N => total number of documents; 
n => number of documents to which a term is assigned

The output originates from the training process. For training (assuming you are using the RBF kernel) the parameters gamma and C are on default values (if you do not provide them via CLI arguments) - same for CV.

To obtain the "best" gamma and C for your classification problem, you have to proceed with a grid-search (described in the linked guide). Afaik their is no build-in functionality for grid-search in LIBSVM.

[1] T. Joachims: Text Categorization with Support Vector Machines: Learning with Many Relevant Features; Springer: Heidelberg, Germany, 1998, doi:10.1007/BFb0026683

[2] G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513{523, 1988.

Text classification using java libsvm - model, parameter selection

1 Answers