After preprocessing the pyspark dataframe , I am trying to apply pipeline to it but I am getting below error:
java.lang.AssertionError: assertion failed: No plan for MetastoreRelation.
What is the meaning of this and how to solve this.
My code has become quite large, so I will explain the steps 1. I have 8000 columns and 68k rows in my spark dataframe. Out of 8k columns, 500 are categorical to which I applied pyspark.ml one hot encoding as a stage in ml.pipeline encoders2 = [OneHotEncoder(inputCol=c,
outputCol="{0}_enc".format(c)) for c in cat_numeric[i:i+2]]
but this is very slow and even after 3 hours it was not complete. I am using 40gb memory on each of 12 nodes!.
2. So I am reading 100 columns from pyspark dataframe , creating pandas dataframe from that and doing one hot encoding. Then I transform pandas daaframe back into pyspark data and merge it with original dataframe.
3. Then I try to apply pipeline with stages of string indexer and OHE for categorical string features which are just 5 and then assembler to create 'features' and 'labels' . But in this stage I get the above error.
4. Please let me know if my approach is wrong or if I am missing anything. Also let me know if you want more information. Thanks