1
votes

Let's consider the dataset of House prices from this example.

I have the entire dataset stored in the housing variable:

housing.shape

(20640, 10)

I also have done a OneHotEncoder encoding of one dimensions and get housing_cat_1hot, so

housing_cat_1hot.toarray().shape

(20640, 5)

My target is to join the two variables and store everything in just one dataset.

I have tried the Join with index tutorial but the problem is that the second matrix haven't any index. How can I do a JOIN between housing and housing_cat_1hot?

>>> left=housing
>>> right=housing_cat_1hot.toarray()
>>> result = left.join(right)

Traceback (most recent call last): File "", line 1, in result = left.join(right) File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pandas/core/frame.py", line 5293, in join rsuffix=rsuffix, sort=sort) File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pandas/core/frame.py", line 5323, in _join_compat can_concat = all(df.index.is_unique for df in frames) File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pandas/core/frame.py", line 5323, in can_concat = all(df.index.is_unique for df in frames) AttributeError: 'numpy.ndarray' object has no attribute 'index'

3
Well if you do to_array it becomes a numpy array. Join takes either a dataframe or a series not an array. maybe left.join(housing_cat_1hot) is all you needBharath

3 Answers

1
votes

Well, depends on how you created the one-hot vector. But if it's sorted the same as your original DataFrame, and itself is a DataFrame, you can add the same index before joining:

housing_cat_1hot.index = range(len(housing_cat_1hot))

And if it's not a DataFrame, convert it to one. This is simple, as long as both objects are sorted the same

Edit: If it's not a DataFrame, then: housing_cat_1hot = pd.DataFrame(housing_cat_1hot)

Already creates the proper index for you

1
votes

If you wish to join the two arrays (assuming both housing_cat_1hot and housing are arrays), you can use

housing = np.hstack((housing, housing_cat_1hot))

Though the best way to OneHotEncode a variable is selecting that variable within the array and encode. It saves you the trouble of joining the two later

Say the index of the variable you wish to encode in your array is 1,

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()  
X[:, 1] = le.fit_transform(X[:, 1])

onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
0
votes

Thanks to @Elez-Shenhar answer I get the following working code:

OneHot=housing_cat_1hot.toarray()
OneHot= pd.DataFrame(OneHot)
result = housing.join(OneHot)
result.shape

(20640, 15)