2
votes

After finding out about many transformations that can be applied on the target values(y column), of a data set, such as box-cox transformations I learned that linear regression models need to be trained with normally distributed target values in order to be efficient.(https://stats.stackexchange.com/questions/298/in-linear-regression-when-is-it-appropriate-to-use-the-log-of-an-independent-va)

I'd like to know if the same applies for non-linear regression algorithms. For now I've seen people on kaggle use log transformation for mitigation of heteroskedasticity, by using xgboost, but they never mention if it is also being done for getting normally distributed target values.

I've tried to do some research and I found in Andrew Ng's lecture notes(http://cs229.stanford.edu/notes/cs229-notes1.pdf) on page 11 that the least squares cost function, used by many algorithms linear and non-linear, is derived by assuming normal distribution of the error. I believe if the error should be normally distributed then the target values should be as well. If this is true then all the regression algorithms using least squares cost function should work better with normally distributed target values.

Since xgboost uses least squares cost function for node splitting(http://cilvr.cs.nyu.edu/diglib/lsml/lecture03-trees-boosting.pdf - slide 13) then maybe this algorithm would work better if I transform the target values using box-cox transformations for training the model and then apply inverse box-cox transformations on the output in order to get the predicted values. Will this theoretically speaking give better results?

2
If you have data generated from a linear function with non-normal errors and you apply linear regression to it the fit will not be optimally efficient but because it is a consistent estimator given enough data you will converge to the right answer - search for consistent within en.wikipedia.org/wiki/Ordinary_least_squares. If you transform the data in a way which means that the underlying curve is no longer linear there is no way for linear regression to return the correct answer to you.mcdowella
Thank you for the answer but the main question involves non-linear regression.Phill Donn

2 Answers

1
votes

Your conjecture "I believe if the error should be normally distributed then the target values should be as well." is totally wrong. So your question does not have any answer at all since it is not a valid question.

There are no assumptions on the target variable to be Normal at all.

Getting the target variable transformed does not mean the errors are normally distributed. In fact, that may ruin normality.

1
votes

I have no idea what this is supposed to mean: "linear regression models need to be trained with normally distributed target values in order to be efficient." Efficient in what way?

Linear regression models are global models. They simply fit a surface to the overall data. The operations are matrix operations, so the time to "train" the model depends only on the size of data. The distribution of the target has nothing to do with model building performance. And, it has nothing to do with model scoring performance either.

Because targets are generally not normally distributed, I would certainly hope that such a distribution is not required for a machine learning algorithm to work effectively.