1
votes

I have a dataset with a total of 58 samples. The dataset has two columns "measured signals" and "people_in_area". Due to it, I am trying to train a Linear Regression model using Scikit-learn. For the moment, I splited 75% of my dataset for training and 25% for testing. However, depending on the order in which the data was before the split, I obtain different R-squared values.

I think that as the dataset is small, depending on the order in which the data was before being splited, different values would be kept as x_test and y_test. Due to it, I am thinking on using "Cross-Validation" on my Linear Regression model to divide the test and train data randomly several times, training it more and, also, being able to test more, obtaining in this way more reliable results. Is this a correct approach?

1
I suggest this question would be better suited to "Cross Validated" since it focuses on techniques rather than programming - DontDivideByZero

1 Answers

1
votes

Yes, using cross validation will give you a better estimate of your model performance.

Splitting randomly(cross validation) will however not work for time-series and/or all distributions of data.

The "final model" will not be better only your estimate on model performance.