Prediction for string in linear regression in python

Question

I have a dataset like with three columns Type/Name/Price and want to predict the price based on Type and name. Here Type/Name are categorical string values. And Price is numeric target variable.

My dataset looks like:

Type Name Price
A    ec1  1.5
B    ec2  2
A    ec2  3
C    ec1  1
B    ec3  1

I have to create a model for this dataset and want to predict for type/name. What is the predicted price for Type - A and Name ec2? Could you please provide the sample code.

Also, the dataset wont have fixed number of columns. Only the target variable is fixed as Price. Independent variables might have Type/Name/Date..etc fields.

Have you considered categorize those variables into a homogeneous type? For example: A - ec1 --> Type_I A - ec2 --> Type_II B - ec1 --> Type_III — Raúl Reguillo Carmona

Jundiaius Jundiaius · Accepted Answer · 2017-10-05T15:07:23

Use a dictionary vectorizer on your input data. It will transform your categorical features into binary features of a vector.

Read more about it here: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html#sklearn.feature_extraction.DictVectorizer

If I take your dataset as a example, it will look something like that:

data = [{"type": A, "name": ec1},
        {"type": B, "name": ec2},
        {"type": A, "name": ec2},
        {"type": C, "name": ec1},
        {"type": B, "name": ec3}]

from sklearn.feature_extraction import DictVectorizer

vectorizer = DictVectorizer()
vector_data = vectorizer.fit_transform(data)

Now your vector_data is ready to be used in a Machine Learning model.

Prediction for string in linear regression in python

2 Answers