Finding features that influence net revenue

Question

Using machine learning I would like to identify features that influence net revenue and make conclusions from data based on that. The data set is a car sharing company data (like Turo). Data set contains ~80000 rows 14 columns.

I have difficulty to build a EDA especially with ML algorithm to use to find out features that influence on net_revenue.

What I did so far

I did correlation matrix analysis on this data and find out 'youth driver fee' has the most correlated feature to 'net_revenue' ( I kept make and model columns out of the analysis because there are so many makes and models and its hard to predict their effect on the net_revenue)
I wanted to see this correlation is relevant with some ML algorithms such as Logistic regression and Randomforest. To further applying RandomForest ML to verify this correlation I converted categorical variables (payment_type, returning_guest and returning_host) to the dummy variables (0's and 1's)

So I tried to apply these two models by following this post

LogisticRegression

cols=['driver_age', 'completed_trips', 'vehicle_price', 'lead_time', 'trip_length', 
              'trip_revenue', 'youth_driver_fee', 'insurance_fee', 'delivery_fee', 'returning_quest_First_time','returning_quest_Repeat','returning_host_First_time','returning_host_repeat']

            X=data[cols]
            y=data['net_revenue']

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

*default settings of LogisticRegression

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class=’ovr’, n_jobs=1, penalty=’l2', random_state=None, solver=’liblinear’, tol=0.0001, verbose=0, warm_start=False)

**The IPython notebook freezes after executing the code above and it looks like it would never output something.So I have to restart the kernel.

RandomForest

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

**Same problem!

My Questions

How can I a ML model for finding features that influence net revenue? Is there any resource that the same problem addressed ? Kaggle competitions definaltely fine or maybe a medium post.

I found one dataset to predict features on target value but target value look like categorical mine is continuous. from https://www.kaggle.com/prasadkevin/prediction-of-quality-of-wine

to use LogisticRegression and RandomForest, Is net_revenue has to be categorical variable ?
Do you happen to know and similar dataset on Kaggle:) ? because I could not find any correlated ML flow like this one!

Any advice would be appreciated!

Thx

Kyle Kyle · Accepted Answer · 2019-09-08T09:21:34

A few things.

When using any machine learning model, you have to convert every categorical variable to a dummy variable, not just for Random Forests.

You are using RandomForestClassifier for a regression problem, which is not what you want. Instead use sklearn.ensemble.RandomForestRegressor.

Your machine learning models are probably running if no errors are being thrown. Since you have 80,000 rows it may just take a while. When you define your models, define them as

logreg = LogisticRegression(verbose=1)

and

rf = RandomForestRegressor(verbose=1)

If the models are running they will print out their progress so you can see what is going on.

Finding features that influence net revenue

What I did so far

My Questions

1 Answers