1
votes

I'm trying to use the direct java port of libsvm for text classification.

Currently have bad accuracy and following this guide to tune model. At this stage have 2 doubts.

  1. I do scaling by taking termcount/totalterms_in_doc. Will this be adequate?
  2. When running svm.svm_cross_validation / svm.svm_train methods I get following output, what does it mean, how do I infer gamma and C from this?

    optimization finished, #iter = 1504 nu = 0.5800464037122964 obj = -299.9624358558652, rho = -0.9799716681242028 nSV = 3000, nBSV = 3000 Total nSV = 3000 version of libsvm=3.2, java=1.7

1

1 Answers

4
votes

I assume you proceed for text classification as described in [1].

To obtain your feature values (simple scaling, no scaling, what ever you want), you have various possibilities [2].

TFC method works quite well [1], for details [2]:

TermFrequency * log(N/n) * 1 / sum((TermFrequency_i * log(N/n_i))^2)

with

sum => every term in your document collection; 
N => total number of documents; 
n => number of documents to which a term is assigned

The output originates from the training process. For training (assuming you are using the RBF kernel) the parameters gamma and C are on default values (if you do not provide them via CLI arguments) - same for CV.

To obtain the "best" gamma and C for your classification problem, you have to proceed with a grid-search (described in the linked guide). Afaik their is no build-in functionality for grid-search in LIBSVM.

[1] T. Joachims: Text Categorization with Support Vector Machines: Learning with Many Relevant Features; Springer: Heidelberg, Germany, 1998, doi:10.1007/BFb0026683

[2] G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513{523, 1988.