0
votes

I have a dataset like with three columns Type/Name/Price and want to predict the price based on Type and name. Here Type/Name are categorical string values. And Price is numeric target variable.

My dataset looks like:

Type Name Price
A    ec1  1.5
B    ec2  2
A    ec2  3
C    ec1  1
B    ec3  1

I have to create a model for this dataset and want to predict for type/name. What is the predicted price for Type - A and Name ec2? Could you please provide the sample code.

Also, the dataset wont have fixed number of columns. Only the target variable is fixed as Price. Independent variables might have Type/Name/Date..etc fields.

2
Have you considered categorize those variables into a homogeneous type? For example: A - ec1 --> Type_I A - ec2 --> Type_II B - ec1 --> Type_III - Raúl Reguillo Carmona

2 Answers

1
votes

Use a dictionary vectorizer on your input data. It will transform your categorical features into binary features of a vector.

Read more about it here: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html#sklearn.feature_extraction.DictVectorizer

If I take your dataset as a example, it will look something like that:

data = [{"type": A, "name": ec1},
        {"type": B, "name": ec2},
        {"type": A, "name": ec2},
        {"type": C, "name": ec1},
        {"type": B, "name": ec3}]

from sklearn.feature_extraction import DictVectorizer

vectorizer = DictVectorizer()
vector_data = vectorizer.fit_transform(data)

Now your vector_data is ready to be used in a Machine Learning model.

1
votes

I convert the string values to numeric to fit linear model

from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer
import StringIO
data ='''Type,Name,Price
A,ec1,1.5
B,ec2,2
A,ec2,3
C,ec1,1
B,ec3,1'''
df = pd.read_csv(StringIO.StringIO(data))
mapping = {}
cols = df.drop('Price', axis=1).columns
for col in cols:
  mapping[col] = {name: i for i, name in enumerate(df[col].unique())}
def mapping_func(row):
  return pd.Series([mapping[col][row[col]] for col in cols])

X = df.apply(mapping_func, axis=1)
y = df['Price']
model = LinearRegression()

model.fit(X, y)
print model.predict([ mapping['Type']['B'], mapping['Name']['ec2']] )

output:

[ 1.57692308]