I have the following code which takes in a mix of categorical, numeric features, string indexes the categorical features, then one hot encodes the categorical features, then assembles both one hot encoded categorical features and numeric features, runs them trough a random forest and prints the resultant tree. I want the tree nodes to have the original features names (i.e Frame_Size etc). How can I do that? In general how can I decode one hot encoded and assembled features?
# categorical features : start singindexing and one hot encoding
column_vec_in = ['Commodity','Frame_Size' , 'Frame_Shape', 'Frame_Color','Frame_Color_Family','Lens_Color','Frame_Material','Frame_Material_Summary','Build', 'Gender_Global', 'Gender_LC'] # frame Article_Desc not slected because the cardinality is too high
column_vec_out = ['Commodity_catVec', 'Frame_Size_catVec', 'Frame_Shape_catVec', 'Frame_Color_catVec','Frame_Color_Family_catVec','Lens_Color_catVec','Frame_Material_catVec','Frame_Material_Summary_catVec','Build_catVec', 'Gender_Global_catVec', 'Gender_LC_catVec']
indexers = [StringIndexer(inputCol=x, outputCol=x+'_tmp') for x in column_vec_in ]
encoders = [OneHotEncoder(dropLast=False, inputCol=x+"_tmp", outputCol=y) for x,y in zip(column_vec_in, column_vec_out)]
tmp = [[i,j] for i,j in zip(indexers, encoders)]
tmp = [i for sublist in tmp for i in sublist]
#categorical and numeric features
cols_now = ['SODC_Regular_Rate','Commodity_catVec', 'Frame_Size_catVec', 'Frame_Shape_catVec', 'Frame_Color_catVec','Frame_Color_Family_catVec','Lens_Color_catVec','Frame_Material_catVec','Frame_Material_Summary_catVec','Build_catVec', 'Gender_Global_catVec', 'Gender_LC_catVec']
assembler_features = VectorAssembler(inputCols=cols_now, outputCol='features')
labelIndexer = StringIndexer(inputCol='Lens_Article_Description_reduced', outputCol="label")
tmp += [assembler_features, labelIndexer]
# converter = IndexToString(inputCol="featur", outputCol="originalCategory")
# converted = converter.transform(indexed)
pipeline = Pipeline(stages=tmp)
all_data = pipeline.fit(df_random_forest_P_limited).transform(df_random_forest_P_limited)
all_data.cache()
trainingData, testData = all_data.randomSplit([0.8,0.2], seed=0)
rf = RF(labelCol='label', featuresCol='features',numTrees=10,maxBins=800)
model = rf.fit(trainingData)
print(model.toDebugString)
After I run the spark machine learning pipeline I want to print out the random forest as a tree.Currently it looks like below.
What I actually want to see is the original categorical feature names instead of feature 1, feature 2 etc. The fact that the categorical features are one hot encoded and vector assembled makes it hard for me to unindex/decode the feature names. How can I unidex/decode onehot encoded and assembled feature vectors in pyspark? I have a vague idea that I have to use " IndexToString()" but I am not exactly sure because there is a mix of numeric, categorical features and they are one hot encoded and assembled.