Spark and categorical string variables

Question

I'm trying to understand how spark.ml handles string categorical independent variables. I know that in Spark I have to convert strings to doubles using StringIndexer.
Eg., "a"/"b"/"c" => 0.0/1.0/2.0.
But what I really would like to avoid is then having to use OneHotEncoder on that column of doubles. This seems to make the pipeline unnecessarily messy. Especially since Spark knows that the data is categorical. Hopefully the sample code below makes my question clearer.

import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.classification.LogisticRegression

val df = sqlContext.createDataFrame(Seq(
Tuple2(0.0,"a"), Tuple2(1.0, "b"), Tuple2(1.0, "c"), Tuple2(0.0, "c")
)).toDF("y", "x")

// index the string column "x"
val indexer = new StringIndexer().setInputCol("x").setOutputCol("xIdx").fit(df)
val indexed = indexer.transform(df)

// build a data frame of label, vectors
val assembler = (new VectorAssembler()).setInputCols(List("xIdx").toArray).setOutputCol("features")
val assembled = assembler.transform(indexed)

// build a logistic regression model and fit it
val logreg = (new LogisticRegression()).setFeaturesCol("features").setLabelCol("y")
val model = logreg.fit(assembled)

The logistic regression sees this as a model with only one independent variable.

model.coefficients
res1: org.apache.spark.mllib.linalg.Vector = [0.7667490491775728]

But the independent variable is categorical with three categories = ["a", "b", "c"]. I know I never did a one of k encoding but the metadata of the data frame knows that the feature vector is nominal.

import org.apache.spark.ml.attribute.AttributeGroup
AttributeGroup.fromStructField(assembled.schema("features"))
res2: org.apache.spark.ml.attribute.AttributeGroup = {"ml_attr":{"attrs":
{"nominal":[{"vals":["c","a","b"],"idx":0,"name":"xIdx"}]},
"num_attrs":1}}

How do I pass this information to LogisticRegression? Is this not the whole point of keeping dataframe metadata? There does not seem to be a CategoricalFeaturesInfo in SparkML. Do I really need to do a 1 of k encoding for each categorical feature?

Why can't you use one-hot encoding directly on the strings? Some models explicitly allow categorical features, but logistic regression expects continuous features (e.g, mappings of a:0, b:1, c:2, imply c is twice as much as b) — MattMcKnight
OneHotEncoder is expecting a column of DoubleType. See also the example of OneHotEncoder where they first transform using StringIndexer and then OneHotEncoder. — Ben
Good answer to my long winded question :). In that case can you explain a little bit about what the point of that meta data is? Where will the fact that the first entry of this vector is actually a nominal variable named "xIdx" with vals = ["c", "a", "b"] get used? — Ben
@zero323 It actually is used right now to extract information for tree models . See here. One of the main differences between ml and mllib implementations is the "use of DataFrame metadata to distinguish continuous and categorical features" — Ben

Alessandro S. Alessandro S. · Accepted Answer · 2018-01-04T21:21:17

Maybe I am missing something, but this really looks like the job for RFormula (https://spark.apache.org/docs/latest/ml-features.html#rformula).

As the name suggests, it takes an "R-style" formula that describes how the feature vector is composed from the input data columns.

For each categorical input columns (that is, StringType as type) it adds a StringIndexer + OneHotEncoder to the final pipeline implementing the formula under the hoods.

The output is a feature vector (of doubles) that can be used with any algorithm in the org.apache.spark.ml package, as the one you are targeting.

Spark and categorical string variables

1 Answers