pyspark unindex one-hot encoded and assembled columns

Question

I have the following code which takes in a mix of categorical, numeric features, string indexes the categorical features, then one hot encodes the categorical features, then assembles both one hot encoded categorical features and numeric features, runs them trough a random forest and prints the resultant tree. I want the tree nodes to have the original features names (i.e Frame_Size etc). How can I do that? In general how can I decode one hot encoded and assembled features?

    # categorical features : start singindexing and one hot encoding 
    column_vec_in = ['Commodity','Frame_Size' , 'Frame_Shape', 'Frame_Color','Frame_Color_Family','Lens_Color','Frame_Material','Frame_Material_Summary','Build', 'Gender_Global', 'Gender_LC'] # frame Article_Desc not slected because the cardinality is too high
    column_vec_out = ['Commodity_catVec', 'Frame_Size_catVec', 'Frame_Shape_catVec', 'Frame_Color_catVec','Frame_Color_Family_catVec','Lens_Color_catVec','Frame_Material_catVec','Frame_Material_Summary_catVec','Build_catVec', 'Gender_Global_catVec', 'Gender_LC_catVec']

    indexers = [StringIndexer(inputCol=x, outputCol=x+'_tmp') for x in column_vec_in ]

    encoders = [OneHotEncoder(dropLast=False, inputCol=x+"_tmp", outputCol=y) for x,y in zip(column_vec_in, column_vec_out)]
    tmp = [[i,j] for i,j in zip(indexers, encoders)]
    tmp = [i for sublist in tmp for i in sublist]



    #categorical and numeric features
    cols_now = ['SODC_Regular_Rate','Commodity_catVec', 'Frame_Size_catVec', 'Frame_Shape_catVec', 'Frame_Color_catVec','Frame_Color_Family_catVec','Lens_Color_catVec','Frame_Material_catVec','Frame_Material_Summary_catVec','Build_catVec', 'Gender_Global_catVec', 'Gender_LC_catVec']
    assembler_features = VectorAssembler(inputCols=cols_now, outputCol='features')
    labelIndexer = StringIndexer(inputCol='Lens_Article_Description_reduced', outputCol="label")
    tmp += [assembler_features, labelIndexer]



    # converter = IndexToString(inputCol="featur", outputCol="originalCategory")
    # converted = converter.transform(indexed)


    pipeline = Pipeline(stages=tmp)

    all_data = pipeline.fit(df_random_forest_P_limited).transform(df_random_forest_P_limited)


    all_data.cache()
    trainingData, testData = all_data.randomSplit([0.8,0.2], seed=0)


    rf = RF(labelCol='label', featuresCol='features',numTrees=10,maxBins=800)
    model = rf.fit(trainingData)

    print(model.toDebugString)

After I run the spark machine learning pipeline I want to print out the random forest as a tree.Currently it looks like below.

What I actually want to see is the original categorical feature names instead of feature 1, feature 2 etc. The fact that the categorical features are one hot encoded and vector assembled makes it hard for me to unindex/decode the feature names. How can I unidex/decode onehot encoded and assembled feature vectors in pyspark? I have a vague idea that I have to use " IndexToString()" but I am not exactly sure because there is a mix of numeric, categorical features and they are one hot encoded and assembled.

user1808924 user1808924 · Accepted Answer · 2017-05-24T09:10:02

Export the Apache Spark ML pipeline to a PMML document using the JPMML-SparkML library. A PMML document can be inspected and interpreted by humans (eg. using Notepad), or processed programmatically (eg. using other Java PMML API libraries).

The "model schema" is represented by the /PMML/MiningModel/MiningSchema element. Each "active feature" is represented by a MiningField element; you can retrieve their "type definitions" by looking up the corresponding /PMML/DataDictionary/DataField element.

Edit: Since you were asking about PySpark, you might consider using the JPMML-SparkML-Package package for export.

pyspark unindex one-hot encoded and assembled columns

1 Answers