Large number of features in Machine Learning is bad (regression)?

Question

I am doing linear regression analysis of apartments characteristics and then predict price of an apartment. For now, I have collected characteristics for 13000 apartments in my city. I have 23-25 features and I am not sure if it is normal to have such amount of features in apartment price prediction.

I have these features:
District, microdistrict, residential community, year of building, house building material, number of rooms, storey, total area, living area, condition, floor material, bathroom type, balcony, door type, landline, internet connection type, parking availability, furniture availability, ceiling height, security.

Is it normal to have such number of features for regression? Are these features suitable for doing linear regression analysis of apartments? May be it is better to reduce number of features and get rid of some features due to redundancy? Is it possible that large number of features in my case (apartment price prediction) will lead to overfitting?

stellasia stellasia · Accepted Answer · 2015-12-03T20:44:23

How did you find these features? Have you already run a feature selection algorithm on your dataset? I really doubt about it. I don't know which steps you followed already, but when starting a machine learning problem, you first have to get some intuition on your data:

Look at the data making histograms, correlation plots... For example, area and number of rooms might be highly correlated...
If you want to perform linear regression, you have to make sure the relation with your target variable (ie price) is really linear: it may be necessary to use some functions of the original features to get linear relationship
Once you have a better idea of the features that seem to contribute, you can use some feature selection algorithm (eg the one packaged in sklearn if you are using python)

Large number of features in Machine Learning is bad (regression)?

2 Answers