weka - normalize nominal values

Question

I have this data set:

Instance num 0 : 300,24,'Social worker','Computer sciences',Music,10,5,5,1,5,''
Instance num 1 : 1000,20,Student,'Computer engineering',Education,10,5,5,5,5,Sony
Instance num 2 : 450,28,'Computer support specialist',Business,Programming,10,4,1,0,4,Lenovo
Instance num 3 : 1000,20,Student,'Computer engineering','3d Design',1,1,2,1,3,Toshiba
Instance num 4 : 1000,20,Student,'Computer engineering',Programming,2,5,1,5,4,Dell
Instance num 5 : 800,16,Student,'Computer sciences',Education,8,4,3,4,4,Toshiba

and I want to classify using SMO and other multi-class classifiers so I convert all the nominal values to numeric using this code :

int [] indices={2,3,4,10};  // indices of nominal columns
for (int i = 0; i < indices.length; i++) {
  int attInd = indices[i];
  Attribute att = data.attribute(attInd);
  for (int n = 0; n < att.numValues(); n++) {

    data.renameAttributeValue(att, att.value(n), "" + n);
  }
}

and the result is:

Instance num 0 : 300,24,0,0,0,10,5,5,1,5,0
Instance num 1 : 1000,20,1,1,1,10,5,5,5,5,1
Instance num 2 : 450,28,2,2,2,10,4,1,0,4,2
Instance num 3 : 1000,20,1,1,3,1,1,2,1,3,3
Instance num 4 : 1000,20,1,1,2,2,5,1,5,4,4
Instance num 5 : 800,16,1,0,1,8,4,3,4,4,3

after applying the "Normalize" filter the result will be like this:

Instance num 0 : 0,0.666667,0,0,0,1,1,1,0.2,1,0
Instance num 1 : 1,0.333333,1,1,1,1,1,1,1,1,1
Instance num 2 : 0.214286,1,2,2,2,1,0.75,0,0,0.5,2
Instance num 3 : 1,0.333333,1,1,3,0,0,0.25,0.2,0,3
Instance num 4 : 1,0.333333,1,1,2,0.111111,1,0,1,0.5,4
Instance num 5 : 0.714286,0,1,0,1,0.777778,0.75,0.5,0.8,0.5,3

the problem is the converted columns still in String "Normalize" filter will not normalize them...

Any ideas?

and my second question: what should I use as multi-class classifier beside SMO?

What is the desired result? What do you want to happen to the strings? As for an alternative multiclass classifier, knn or a J48 decision tree would be a good baseline. — Junuxx
the desired result is the brand of laptop (the last column in the data set) but the problem with the result of SMO Evaluation Correctly Classified Instances 30.6452 % , And i tried several classifiers and the Correctly Classified proportion still under 35% what should i do? — Farok Ojil
How many instances do you have? Getting more data could improve the results. And how many classes/brands do you have? If you have 200 classes, 35% isn't bad at all. — Junuxx
i have 200 instances and 12 class how much instances dose i need ? — Farok Ojil
Hard to say, depends on the 'difficulty' of the dataset. But you have roughly 16 instances per class, and 10 attributes per instance. That is clearly not enough to cover the entire instance space. You should also ask yourself what the goal is, what would be a reasonable result. Common sense indicates that things like occupation, age and budget aren't enough to reliably determine someone's brand preference, but if you reach 40-50% that would be a pretty good educated guess. Probably better than what a human expert would achieve on this task. — Junuxx

smci smci · Accepted Answer · 2015-09-25T00:07:15

Don't convert nominals/categoricals into floats(/integers), and then normalize them. It's meaningless. Garbage In, Garbage Out. Treating them as continuous numbers or numeric vectors gives nonsense results like "the average of 'Engineering' + 'Nursing' = 'Architecture'"

The right way to treat nominals/categoricals is to convert each one into dummy variables (also known as 'dummy coding' or 'dichotomizing'). Say if Occupation column (or Major, or Elective, or whatever) has K levels, then you create either K or (K-1) binary variables which are everywhere 0 except for one corresponding column containing a 1. Look up Weka documentation to find the right function call.

cf. e.g. SO: Dummy Coding of Nominal Attributes (for Logistic Regression)

weka - normalize nominal values

2 Answers