How to obtain the number of features after preprocessing to use pyspark.ml neural network classifier?

Question

I am trying to build a neural network using pyspark.ml. The problem is that I am using onehotencoder and other pre-processing methods to transform the categorical variables. The stages in my pipeline are:

indexing the categorical features
using Onehotencoder
using Vector assembler
then I apply PCA
giving the "pcaFeatures" to a neural network classifier

But the problem is that I don't know the number of features after the step 4 to give it to "layers" of the classifier in step 5. My question is that how can I obtain the final number of features? here is my code, I did not include the import and data loading part.

stages = []
for c in Categories:
    stringIndexer = StringIndexer(inputCol= c , outputCol=c + "_indexed")
    encoder = OneHotEncoder(inputCol= c + "_indexed", outputCol=c + "_categoryVec")
    stages += [stringIndexer, encoder]

labelIndexer = StringIndexer(inputCol="Target", outputCol="indexedLabel")

final_features = list(map(lambda c: c+"_categoryVec", Categories))+Continuous


assembler = VectorAssembler(
    inputCols= final_features,
    outputCol="features")

pca = PCA(k=20, inputCol="features", outputCol="pcaFeatures")
(train_val, test_val) = train.randomSplit([0.95, 0.05])

num_classes= train.select("Target").distinct().count()

NN= MultilayerPerceptronClassifier(labelCol="indexedLabel", featuresCol='pcaFeatures', maxIter=100,
                                    layers=[????, 5, 5, num_classes], blockSize=10, seed=1234)


stages += [labelIndexer]
stages += [assembler]
stages += [pca]
stages += [NN]

pipeline = Pipeline(stages=stages)
model = pipeline.fit(train_val)

pault pault · Accepted Answer · 2018-01-29T16:31:22

From the docs, the input parameter k is the number of principal components.

So in your case:

pca = PCA(k=20, inputCol="features", outputCol="pcaFeatures")

The number of features is 20.

Update

Another way to do it would be to look at the length of one of the assembled vectors.

For example, if you wanted the length after Step 3:

from pyspark.sql.functions import udf, col
nfeatures = assembler.withColumn('len', udf(len, IntegerType())(col('features'))\
    .select('len').take(1)

I feel like there should be a better way to do this, i.e. without having to call take().

How to obtain the number of features after preprocessing to use pyspark.ml neural network classifier?

1 Answers