0
votes

I am doing linear regression analysis of apartments characteristics and then predict price of an apartment. For now, I have collected characteristics for 13000 apartments in my city. I have 23-25 features and I am not sure if it is normal to have such amount of features in apartment price prediction.

I have these features:
District, microdistrict, residential community, year of building, house building material, number of rooms, storey, total area, living area, condition, floor material, bathroom type, balcony, door type, landline, internet connection type, parking availability, furniture availability, ceiling height, security.

Is it normal to have such number of features for regression? Are these features suitable for doing linear regression analysis of apartments? May be it is better to reduce number of features and get rid of some features due to redundancy? Is it possible that large number of features in my case (apartment price prediction) will lead to overfitting?

2

2 Answers

2
votes

How did you find these features? Have you already run a feature selection algorithm on your dataset? I really doubt about it. I don't know which steps you followed already, but when starting a machine learning problem, you first have to get some intuition on your data:

  1. Look at the data making histograms, correlation plots... For example, area and number of rooms might be highly correlated...

  2. If you want to perform linear regression, you have to make sure the relation with your target variable (ie price) is really linear: it may be necessary to use some functions of the original features to get linear relationship

  3. Once you have a better idea of the features that seem to contribute, you can use some feature selection algorithm (eg the one packaged in sklearn if you are using python)

1
votes

@stellasia, good start!

Yes, it's common to have this many features: grab everything you think you might need, and then let your analysis tools (or personal grinding) suggest what isn't needed. It's very hard to add something that you don't have.

You might start by running this through a linear regression modeller. If you don't have one, run correlation coefficients for each feature against the price; this lets you eliminate those near 0 (no apparent effect).

After that, do a full correlation matrix on all the remaining features; those with sigma near +1.00 or -1.00 indicate that you can eliminate either of those factors: they predict each other so well that you don't need both.

SKLearn is good. So is SciKit. Octave and MatLib are excellent if you know how to write the underlying matrix equations.

I can also recommend the open-source package TrustedAnalytics (I'm one of the software heads on the project). The Python API is very good for data science, but it is a big-data package: it sits on top of other tools you may not have.