Suppose I have a four class problem whose continuous output values are in same range. (suppose within 10-50).If I put all class data (the training data) in a single .csv file (for WEKA) with continuous outputs for each training sample...then how my regression model will tell me that the output(continuous) obtained is for which class??? I want to do simultaneous quantification of multi-class input data
1 Answers
I'm still now completely sure what is your problem, but I assume that you have four datasets, in each you have a regression problem, and you want to perform a regression on the cumulative dataset and expect the model not only to recover the regressed value but also the "source".
- In naive implementation (just concatenating data) it is not possible, as you will lose any information regarding the sourece of the data
- You can augment the output value, so instead of just "Y" (regression value) you also add another value indicating which set it is from, for example:
.
Set1:
x11 -> y11
Set2:
x21 -> y21
Set3:
x31 -> y31
Set4:
x41 -> y41
you convert it to
Big set:
x11 -> (y11,1)
x21 -> (y21,2)
x31 -> (y31,3)
x41 -> (y41,4)
Or in unary coding
Big set:
x11 -> (y11,1,0,0,0)
x21 -> (y21,0,1,0,0)
x31 -> (y31,0,0,1,0)
x41 -> (y41,0,0,0,1)
This way, once trained you can not only retrieve the regression value, but also the source dataset (which your refer to as "class").
And of course the simplest solution would be to build two separate models, one for regression and one for the classification, this way you can use models which are designed for these tasks (while using one model requires selecting some suboptimal solution for at least one part).
Big set regression:
x11 -> y11
x21 -> y21
x31 -> y31
x41 -> y41
Big set classification:
x11 -> 1
x21 -> 2
x31 -> 3
x41 -> 4
The question whether it is beneficial to create such "big set" instead of working on the small ones is equivalent of answering the question:
Wheter the value being regressed has very similar predictors in each of the subproblems
So - if you have reasonable justification, that in each subproblem this feature can be derived in the similar way (as it corresponds to some physical phenomena independent on your data splitting) then yes, it is a good approach. Although such steps should be taken only when you have very limited training set - given enough data it is better to assume that each subproblem have different characteristics in order to prevent the model biasing. Finally, the "ultimate" answer is "check both methods and evaluate wich behaves better" (using for example cross-validation), which will also give you some insight into whether this value is truly independant in the subsets.