0
votes

After preprocessing the pyspark dataframe , I am trying to apply pipeline to it but I am getting below error:

java.lang.AssertionError: assertion failed: No plan for MetastoreRelation.

What is the meaning of this and how to solve this. My code has become quite large, so I will explain the steps 1. I have 8000 columns and 68k rows in my spark dataframe. Out of 8k columns, 500 are categorical to which I applied pyspark.ml one hot encoding as a stage in ml.pipeline encoders2 = [OneHotEncoder(inputCol=c, outputCol="{0}_enc".format(c)) for c in cat_numeric[i:i+2]]
but this is very slow and even after 3 hours it was not complete. I am using 40gb memory on each of 12 nodes!. 2. So I am reading 100 columns from pyspark dataframe , creating pandas dataframe from that and doing one hot encoding. Then I transform pandas daaframe back into pyspark data and merge it with original dataframe. 3. Then I try to apply pipeline with stages of string indexer and OHE for categorical string features which are just 5 and then assembler to create 'features' and 'labels' . But in this stage I get the above error. 4. Please let me know if my approach is wrong or if I am missing anything. Also let me know if you want more information. Thanks

1
It's hard to tell what exactly is the rootcause, may be some code snippet can help to understand it properly. can you please add sample code.Rahul Sharma
Please add this details to your question by edit option.Rahul Sharma
@squid I am getting this error even when I run df.count() on my pyspark dataframe.Ajg

1 Answers

1
votes

This error was due to the order of joining the 2 pyspark dataframes. I tried changing the order of join from say a.join(b) to b.join(a) and its working.