I'm trying to create a pipeline in PySpark in order to prepare my data for Random Forest. I'm using Spark 2.2 (2.2.0.2.6.4.0-91).
My data contains no null values. I identified the categorical columns and numerical columns.
I'm encoding categorical columns and defining my label (options['vae']). Then I use VectorAssembler to get a single vector column for my features.
In the pipeline, after fit and transform I should get a spark dataframe with a label column and a features (vector) column.
Unfortunately when fitting I get this error with Spark:
Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o4563.fit. : java.lang.IllegalArgumentException: requirement failed: Output column T_Q_DAV_indexed already exists.*
I didn't find anywhere else this type of error. So I'm asking for help here.
Here is my code for the pipeline:
print("One Hot Encoding...")
stringIndexer = [StringIndexer(inputCol=col, outputCol=col + "_indexed").setHandleInvalid("keep") for col in categoricalColumns]
encoders = [OneHotEncoder(dropLast=True, inputCol=col + '_indexed', outputCol=col+ "_class") for col in categoricalColumns]
label_stringIdx = StringIndexer(inputCol = options['vae'], outputCol = 'label')
print("Vectorizing...")
assembler = VectorAssembler(inputCols=[col+ "_class" for col in categoricalColumns] + numericCols, outputCol="features")
print("Pipeline...")
pipeline = Pipeline(stages = stringIndexer + encoders + [label_stringIdx] + [assembler])
print('Fit...')
pipelineModel = pipeline.fit(s_df)
print('Transform...')
s_df = pipelineModel.transform(s_df)
selectedCols = ['label', 'features'] + cols
s_df = s_df.select(selectedCols)
categoricalColumns
, orT_Q_DAV_indexed
exists ins_df
? Could you try removing all columns but one fromcategoricalColumns
? – Rayan Ral