2
votes

I'm trying to create a pipeline in PySpark in order to prepare my data for Random Forest. I'm using Spark 2.2 (2.2.0.2.6.4.0-91).

My data contains no null values. I identified the categorical columns and numerical columns.

I'm encoding categorical columns and defining my label (options['vae']). Then I use VectorAssembler to get a single vector column for my features.

In the pipeline, after fit and transform I should get a spark dataframe with a label column and a features (vector) column.

Unfortunately when fitting I get this error with Spark:

Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o4563.fit. : java.lang.IllegalArgumentException: requirement failed: Output column T_Q_DAV_indexed already exists.*

I didn't find anywhere else this type of error. So I'm asking for help here.

Here is my code for the pipeline:

print("One Hot Encoding...")
stringIndexer  = [StringIndexer(inputCol=col, outputCol=col + "_indexed").setHandleInvalid("keep") for col in categoricalColumns]
encoders = [OneHotEncoder(dropLast=True, inputCol=col + '_indexed', outputCol=col+ "_class") for col in categoricalColumns]
label_stringIdx = StringIndexer(inputCol = options['vae'], outputCol = 'label')

print("Vectorizing...")
assembler = VectorAssembler(inputCols=[col+ "_class" for col in categoricalColumns] + numericCols, outputCol="features")

print("Pipeline...")
pipeline = Pipeline(stages = stringIndexer + encoders + [label_stringIdx] + [assembler])
    
print('Fit...')
pipelineModel = pipeline.fit(s_df)
    
print('Transform...')
s_df = pipelineModel.transform(s_df)
    
selectedCols = ['label', 'features'] + cols
s_df = s_df.select(selectedCols)
1
This may sound a bit obvious, but did you check there's no duplicates in categoricalColumns, or T_Q_DAV_indexed exists in s_df? Could you try removing all columns but one from categoricalColumns?Rayan Ral
You are right, I had a duplicate in my categorical columns. I found out about an hour ago. Thank you for the answer.hamanic

1 Answers

0
votes

I have the same question.

The solution is to eliminate the column T_Q_DAV_indexed

  • in categoricalColumns (only have T_Q_DAV) or
  • in s_df (I have the columns: T_Q_DAV, T_Q_DAV_indexed and T_Q_DAV_class) ?

(I used the names of the initial question)