Understanding multiple Linear regression

Question

I am doing multiple regression problem. I have the below data set as below.

rank--discipline--yrs.since.phd--yrs.service--sex--salary
[  1           1             19           18    1  139750],......

I am taking salary as dependent variable, and other variable as independent variable. After doing data pre processing, I ran the gradient descent, regression model. I estimated bias(intercept), coefficient for all independent features. I want to do scattered plot for the actual values and regression line for the hypothesis I predicted. Since we have more than one features here,

I have the below questions.

While plotting actual values (scatted plot), how do I decide the x-axis values. Meaning, I have list of values. for example, first row [1,1,19,18,1]=>139750 How do I transform or map [1,1,19,18,1] to x-axis.? I need to somehow make [1,1,19,18,1] to one value, so I can mark a point of (x,y) in the plot.
While plotting regression line, what would be the feature values, so I can calculate the hypothesis value.? Meaning now, I have the intercept, and weight of all features, but I dont have the feature values. How do I decide upon the feature values now.?

I want to calculate the points and use matplot to do the jobs. I am aware that there are lot of tools available outside including matplotlib to do the job. But I want to get the basic understanding.

Thanks.

If you have multiple targets in your dataset, the best approach is to plot each target separately, i.e. in each plot have only 1 target. As for your question 1, is the data stored as numpy arrays or pandas DataFrame? Coming to question 2, can you explain where you want to plot it. are there only two variables? Do you already have the coefficient and intercept, also most Importantly, can you post the result of df.describe() on your input database. Also if possible try to reframe your second question as I am still a bit confused by it. — anand_v.singh
Hi, I have exactly only one target (salary) in my data set and 5 features (rank, discipline,etc..). I am using pandas for df. my query on point1 is, lets say the first data instance is [1,1,19,18,1] and target value is 139750. how do I plot these values in x,y axis. Since I have more than one value for x axis, how I do i convert it.? Query on 2nd question is, lets say [1,2,3,4,5,6] is the intercept and co-efficients i have arrived for this dataset. intercept=1, and rest are the weight for the features. formula would be h(x)=1+2X1+3X2+4X3+5X4+6X5. what are the value I can take for X1, X2...X5? — Selva Ganesh S
Not a programming question, better suited for Cross Validated. — desertnaut

anand_v.singh anand_v.singh · Accepted Answer · 2019-02-11T08:54:54

I am still not sure I completely understand your question, so if something is not what you expected comment below and we will work it out.

Now,

Query 1: In all your datasets you are going to have multiple inputs and there is no way to view the target variable salary in your case with respect to all, in a single graph, what is usually done is either you apply the concept of dimensionality reduction on your data using t-sne (link) or you use principal component analysis (PCA) to reduce the dimensionality of your data, and make your output a function of two or three variables and then plot it on the screen, the other technique that I prefer is rather plotting target vs each variable separately as subplot, The reason for this is we don't even have a way to comprehend how we will see the data that is in more than three dimensions.

Query 2: If you are not determined to use matplotlib, I will suggest seaborn.regplot(), but let's also do it in matplotlib. Suppose the variable you want to use first is 'discipline' vs 'salary'.

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = df[['discipline']]
Y = df['salary']
lm.fit(X,Y)

After running this lm.coef_ will give you the coefficient, and lm.intercept_ will give you the intercept, in a linear equation that forms this variable, then you can plot the data between two variables and a line using matplotlib easily.

Understanding multiple Linear regression

2 Answers