Using machine learning I would like to identify features that influence net revenue and make conclusions from data based on that. The data set is a car sharing company data (like Turo). Data set contains ~80000 rows 14 columns.
I have difficulty to build a EDA especially with ML algorithm to use to find out features that influence on net_revenue.
What I did so far
I did correlation matrix analysis on this data and find out
'youth driver fee'has the mostcorrelatedfeature to'net_revenue'( I keptmakeandmodelcolumns out of the analysis because there are so many makes and models and its hard to predict their effect on thenet_revenue)I wanted to see this correlation is relevant with some ML algorithms such as
Logistic regressionandRandomforest. To further applying RandomForest ML to verify this correlation I converted categorical variables (payment_type, returning_guest and returning_host) to the dummy variables (0's and 1's)
So I tried to apply these two models by following this post
LogisticRegression
cols=['driver_age', 'completed_trips', 'vehicle_price', 'lead_time', 'trip_length',
'trip_revenue', 'youth_driver_fee', 'insurance_fee', 'delivery_fee', 'returning_quest_First_time','returning_quest_Repeat','returning_host_First_time','returning_host_repeat']
X=data[cols]
y=data['net_revenue']
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
*default settings of LogisticRegression
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class=’ovr’, n_jobs=1, penalty=’l2', random_state=None, solver=’liblinear’, tol=0.0001, verbose=0, warm_start=False)
**The IPython notebook freezes after executing the code above and it looks like it would never output something.So I have to restart the kernel.
RandomForest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
**Same problem!
My Questions
- How can I a ML model for finding features that influence net revenue? Is there any resource that the same problem addressed ? Kaggle competitions definaltely fine or maybe a medium post.
I found one dataset to predict features on target value but target value look like categorical mine is continuous. from https://www.kaggle.com/prasadkevin/prediction-of-quality-of-wine
to use
LogisticRegressionandRandomForest, Isnet_revenuehas to be categorical variable ?Do you happen to know and similar dataset on Kaggle:) ? because I could not find any correlated ML flow like this one!
Any advice would be appreciated!
Thx
