9
votes

I wrote a simple linear regression and decision tree classifier code with Python's Scikit-learn library for predicting the outcome. It works well.

My question is, Is there a way to do this backwards, to predict the best combination of parameter values based on imputed outcome (parameters, where accuracy will be the best).

Or I can ask like this, is there a classification, regression or some other type of algorithm (Decision tree, SVM, KNN, Logistic regression, Linear regression, Polynomial regression...) that can predict multiple outcomes based on one (or more) parameter/s?

I have tried to do this with putting multivariate outcome, but it shows the error:

ValueError: Expected 2D array, got 1D array instead: array=[101 905 182 268 646 624 465]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

This is the code that I wrote for regression:

import pandas as pd
from sklearn import linear_model
from sklearn import tree

dic = {'par_1': [10, 30, 13, 19, 25, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'outcome': [101, 905, 182, 268, 646, 624, 465]}

df = pd.DataFrame(dic)

variables = df.iloc[:,:-1]
results = df.iloc[:,-1]

regression = linear_model.LinearRegression()
regression.fit(variables, results)

input_values = [14, 2]

prediction = regression.predict([input_values])
prediction = round(prediction[0], 2)
print(prediction)

This is the code that I wrote for decision tree:

dic = {'par_1': [10, 30, 13, 19, 25, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'outcome': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'yes']}

df = pd.DataFrame(dic)

variables = df.iloc[:,:-1]
results = df.iloc[:,-1]

decision_tree = tree.DecisionTreeClassifier()
decision_tree.fit(variables, results)

input_values = [18, 2]

prediction = decision_tree.predict([input_values])[0]
print(prediction)
6
I'm not sure I understand what you mean by "the best combination of parameter [...](parameters where accuracy will be the best)". Do you want the input where your outcome is the biggest possible for the linear regression ? Do you want the input where your are most likely to get yes for your Decision Tree ? - vlemaistre
@vlemaistre I want to input 'outcome' value (yes or no) and I want to get values of parameters that are most likely to get yes for my decision tree - taga
We usually call parameters the variables that define a model and are estimated from data. For instance, the weights in a linear regression. If I understood you correctly, you want to predict your input variables or features. Updating the terminology might make the question easier to understand. - AlCorreia
This line decision_tree.fit(variables, results) returns the following error ValueError: could not convert string to float: 'yes' - SuperKogito

6 Answers

2
votes

You could frame the problem as an optimization problem.

Let your (trained) regression model input values be parameters to be searched.

Define the distance between the model's predicted price (at a given input combination) and the desired price (the price you want) as the cost function.

Then use one of the global optimization algorithms (e.g. genetic optimization) to find such input combination that minimizes the cost (i.e. predicted price is closest to your desired price).

2
votes

As mentioned by @Justas, if you want to find the best combination of input values for which the output variable would be max/min, then it is a optimization problem.

There are quite a good range of non-linear optimizers available in scipy or you can go for meta-heuristics such Genetic Algorithm, Memetic algorithm, etc.

On the other hand, if your aim is to learn the inverse function, which maps the output variable into a set of input variables then the go for MultiOuputRegresssor or MultiOutputClassifier. Both of them can be used as a wrapper on top of any base estimators such as linearRegression, LogisticRegresssion, KNN, DecisionTree, SVM, etc.

Example:

import pandas as pd
from sklearn.multioutput import MultiOutputRegressor, RegressorChain
from sklearn.linear_model import LinearRegression


dic = {'par_1': [10, 30, 13, 19, 25, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'outcome': [101, 905, 182, 268, 646, 624, 465]}

df = pd.DataFrame(dic)

variables = df.iloc[:,:-1]
results = df.iloc[:,-1]

multi_output_reg = MultiOutputRegressor(LinearRegression())
multi_output_reg.fit(results.values.reshape(-1, 1),variables)

multi_output_reg.predict([[100]])

# array([[12.43124217,  1.12571947]])
# sounds sensible according to the training data

#if input variables needs to be treated as categories,
# go for multiOutputClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression

multi_output_clf = MultiOutputClassifier(LogisticRegression(solver='lbfgs'))
multi_output_clf.fit(results.values.reshape(-1, 1),variables)

multi_output_clf.predict([[100]])

# array([[10,  1]])

In most situations, finding one of the input variable value can help in predicting other variables. This approach can be achieved by ClassifierChain or RegressorChain.

To understand the advantage of ClassifierChain, please refer to this example.

Update:


dic = {'par_1': [10, 30, 13, 19, 25, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'outcome': [0, 1, 1, 1, 1, 1 , 0]}

df = pd.DataFrame(dic)

variables = df.iloc[:,:-1]
results = df.iloc[:,-1]

multi_output_clf = MultiOutputClassifier(LogisticRegression(solver='lbfgs',
                                                            multi_class='ovr'))
multi_output_clf.fit(results.values.reshape(-1, 1),variables)

multi_output_clf.predict([[1]])
# array([[13,  3]])

1
votes

Considering the real world example you mentioned I would suggest that you see the input as a range of prices rather than just a price, in that case, features could be group together to correspond to a particular price range.

So you can start by clustering the dataset and forming clusters based on the house price, Mean Shift clustering algorithm will also suggest the number of clusters that can be formed in the data.

You can then identify the min and max house price for each cluster and then you can get the average for the numerical data and the majority of the categorical data (the features you use to predict the house price) and say that these prediction values correspond to this price range.

After the mapping is complete, we could see that the input corresponds to which cluster of the price range and then get the aggregate parameters as mentioned above for the same.

Dataset source: https://github.com/ageron/handson-ml/tree/master/datasets/housing

Code :

import pandas as pd
df = pd.read_csv('housing.csv')
df.drop(['longitude','latitude'], axis=1, inplace=True)
X_train = df['median_house_value']

X_train.head()
import numpy as np
X_train = np.array(X_train)
X_train = np.reshape(X_train,(-1,1))

from sklearn.cluster import MeanShift, estimate_bandwidth
ms = MeanShift(bandwidth=None, bin_seeding=True)
ms.fit(X_train)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)

print("number of estimated clusters : %d" % n_clusters_)
print(labels)

df['cluster'] = labels

df1 = df[df['cluster'] == 1]
df2 = df[df['cluster'] == 0]

ranges = []

ranges.append([min(df1['median_house_value']),max(df1['median_house_value'])])

ranges.append([min(df2['median_house_value']),max(df2['median_house_value'])])


df1_categorical = 'ocean_proximity'
df1_categorical_set = df1[df1_categorical]
df1 = df1.drop(df1_categorical, axis=1)
df2_categorical_set = df2[df1_categorical]
df2 = df2.drop(df1_categorical, axis=1)
df1_feature = []

for i in df1.columns :
    df1_feature.append(np.mean(df1[i]))

df2_feature = []

for i in df1.columns :
    df2_feature.append(np.mean(df2[i]))

print ("Range : ",ranges[0],"\nFeatures : ",df1_feature,'\n',"Range : ",ranges[1],"\nFeatures : ", df2_feature)

If you now print the df1_features and df2_features you would get the average feature values for both cluster ranges (as appended in the list ranges you could print that too) so any house with the price range as that of the first one would have the df1_features as the ideal set of features and same goes with df2_features.

In case you want more price ranges you can use k means for clustering specifying the number of clusters

1
votes

@taga, I think you are referring to Multivariate Regression. I have worked with Partial Least Squares (PLS) for that purpose, that is, having a set of N features, you can create a model for estimating M outputs, which at the end is an NxM matrix. Does this sound like the thing you are looking for? I could detail further on this.

EDIT:

Using the same code you provided would be something like:

import pandas as pd
from sklearn import linear_model
from sklearn import tree

dic = {'par_1': [10, 30, 13, 19, 25, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'outcome1': [101, 905, 182, 268, 646, 624, 465],
       'outcome2': [105, 320, 135, 208, 262, 324, 246]
}

df = pd.DataFrame(dic)

variables = df.iloc[:,:-2]
results = df.iloc[:,-2:]

regression = linear_model.LinearRegression()
regression.fit(variables, results)

input_values = [14, 2]

prediction = regression.predict([input_values])
prediction = [round(x,2) for x in prediction[0]]
print(prediction)

You need to pass your outcomes as a LxM array to the model fitting function where L is the number of samples and M is the number of outcomes.

Hope that helps.

0
votes

I think a basic neural network would do the job if I understand the question. When you say "that can predict multiple outcomes based on one (or more) parameter/s?", you can feed as many parameters as you have or would like to into a neural network, with as many different outcomes as possible. If you decide for your problem that you want a binary decision (i.e. yes or no), a basic perceptron would work as well. Both of these methods allow for as long of an input vector as you would like.

Hope I understood your question correctly and provided a helpful method to resolving it!

0
votes

For your regression you can extract the coefficients and determine which of your inputs will yield the max output. Here is what it could look like :

# We extract the linear's regression coefficients
coeff = regression.coef_
input_values = list(zip(dic['par_1'], dic['par_2']))
# We choose the best input thanks to those coefficients
import numpy as np # import numpy to extract the coeffecients
index_best_input = np.argmax([x[0]*coeff[0] + x[1]*coeff[1] for x in input_values])

best_input = input_values[index_best_input]

In [1] : print(best_input)
Out[1] : (33,3)

For your decision tree, the best way is to look at each leaf and see your precision while taking into account the number of training entries in each leaf. What you could do is print the tree :

from sklearn import tree
import graphviz 
from sklearn.datasets import load_iris
dic = {'par_1': [10, 30, 13, 19, 25, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'outcome': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'yes']}

df = pd.DataFrame(dic)

variables = df.iloc[:,:-1]
results = df.iloc[:,-1]

decision_tree = tree.DecisionTreeClassifier()
decision_tree.fit(variables, results)

dot_data = tree.export_graphviz(decision_tree, out_file=None) 
graph = graphviz.Source(dot_data)  
print(graph)

tree

You can see that there are four good candidates with 100% precision but only sample :

  1. The inputs with par_1>31.5
  2. The inputs with 11.5
  3. The inputs with 16
  4. The inputs with 16