0
votes

Suppose I have a four class problem whose continuous output values are in same range. (suppose within 10-50).If I put all class data (the training data) in a single .csv file (for WEKA) with continuous outputs for each training sample...then how my regression model will tell me that the output(continuous) obtained is for which class??? I want to do simultaneous quantification of multi-class input data

1
You seem to use some wrong words to describe your problem. If you have regression than you do not have classes, classes are elements of classification (finite number of output values - class ids). You are just dealing with multi-dimensional regression. - lejlot
I know the difference of classification and regression (actually I often do these problems) but I have four types of data .In this way I want to perform regression for four class (type) of data. I want mix the data set and then train the regression model with it.. and then predict the continuous output for any of the four class for a given data set...just like we do in multi-class classification problems. (I know the outputs will be continuous) I actually want to perform regression on different types of data simultaneously - sunny
It still seems wrong - how is it different from multidimensional regression, where for each input you have a real vector? - lejlot
I know the difference of classification and regression (actually I often do these problems) but I have four types of data .In this way I want to perform regression for four class (type) of data. I want mix the data set and then train the regression model with it.. and then predict the continuous output for any of the four class for a given data set...just like we do in multi-class classification problems. (I know the outputs will be continuous) I actually want to perform regression on different types of data simultaneously - sunny
@lejlot ...you are right....this is my question not the answer..what is the benefit I could not train multidimensional (multivariable) and multi class data simultaneously for regression (for quantification of gases say).....Then it means, for each type of data I have to make different regression model....which is less beneficial.....OR should I use Classification and regression model simultaneously to first classify and then quantify the the test data set.......sorry if I could still not convey properly...thanks in advance - sunny

1 Answers

0
votes

I'm still now completely sure what is your problem, but I assume that you have four datasets, in each you have a regression problem, and you want to perform a regression on the cumulative dataset and expect the model not only to recover the regressed value but also the "source".

  1. In naive implementation (just concatenating data) it is not possible, as you will lose any information regarding the sourece of the data
  2. You can augment the output value, so instead of just "Y" (regression value) you also add another value indicating which set it is from, for example:

.

Set1:
x11 -> y11

Set2:
x21 -> y21

Set3:
x31 -> y31

Set4:
x41 -> y41

you convert it to

Big set: 
x11 -> (y11,1)
x21 -> (y21,2)
x31 -> (y31,3)
x41 -> (y41,4)

Or in unary coding

Big set: 
x11 -> (y11,1,0,0,0)
x21 -> (y21,0,1,0,0)
x31 -> (y31,0,0,1,0)
x41 -> (y41,0,0,0,1)

This way, once trained you can not only retrieve the regression value, but also the source dataset (which your refer to as "class").

And of course the simplest solution would be to build two separate models, one for regression and one for the classification, this way you can use models which are designed for these tasks (while using one model requires selecting some suboptimal solution for at least one part).

Big set regression: 
x11 -> y11
x21 -> y21
x31 -> y31
x41 -> y41

Big set classification: 
x11 -> 1
x21 -> 2
x31 -> 3
x41 -> 4

The question whether it is beneficial to create such "big set" instead of working on the small ones is equivalent of answering the question:

Wheter the value being regressed has very similar predictors in each of the subproblems

So - if you have reasonable justification, that in each subproblem this feature can be derived in the similar way (as it corresponds to some physical phenomena independent on your data splitting) then yes, it is a good approach. Although such steps should be taken only when you have very limited training set - given enough data it is better to assume that each subproblem have different characteristics in order to prevent the model biasing. Finally, the "ultimate" answer is "check both methods and evaluate wich behaves better" (using for example cross-validation), which will also give you some insight into whether this value is truly independant in the subsets.