2
votes

I am using Spark with Java 8. I have a dataframe where one of the columns contains a single mllib.linalg.Vector. I want to groupBy one of the other columns in the dataframe, say an ID column, and "collect_list" the feature vectors into a list. I'm getting the error below. I don't understand why. This is a generic operation, why does it care about the type of the data in the column? it works fine for scalar numbers, or strings, etc, but does not seem to work for mllib Vector. Is there a workaround this?, maybe another function other than collect_list()?

No handler for Hive udf class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList because: org.apache.spark.mllib.linalg.VectorUDT@f71b0bce (of class org.apache.spark.mllib.linalg.VectorUDT)
1

1 Answers

1
votes

What is the Spark version you are using? With Spark 1.6.2, it throws same error you have mentioned but this is working fine with Spark 2.0.1. See the sample code and output below.

public class JavaVectorExample {
    public static void main(String[] args) {
    //SparkSession
    SparkSession spark = SparkSession
      .builder()
      .appName("JavaVectorExample")
      .master("local[2]")
      .getOrCreate();
    //schema
    StructType schema = createStructType(new StructField[]{
      createStructField("id", IntegerType, false),
      createStructField("label", DoubleType, false),
      createStructField("features", new VectorUDT(), false),
    });
    //dataset
    Row row1 = RowFactory.create(0, 1.0, Vectors.dense(0.0, 10.0, 0.5));
    Row row2 = RowFactory.create(1, 1.0, Vectors.dense(1.0, 10.5, 0.5));
    Row row3 = RowFactory.create(0, 1.5, Vectors.dense(0.0, 10.5, 1.0));
    Dataset<Row> dataset = spark.createDataFrame(Arrays.asList(row1,row2,row3), schema);
    dataset.printSchema();
    //groupby
    dataset.groupBy(col("id")).agg(collect_list(col("features"))).show(false);
    spark.stop();
  }
}

Here is the output.

+---+--------------------------------+
|id |collect_list(features)          |
+---+--------------------------------+
|1  |[[1.0,10.5,0.5]]                |
|0  |[[0.0,10.0,0.5], [0.0,10.5,1.0]]|
+---+--------------------------------+