
I'm trying to build a regression model where the underlying feature matrix is pretty big (418K rows on 73K columns) and is very sparse (58M non zero values which is around 0.2% of the entire matrix).

I have the matrix coordinate representation as a DataFrame where the first column is row coordinate i, the second is column coordinate j and the third is the value in the {i,j}th position.

E.g. the following matrix:


Is represented by

|0|1| 1   |
|1|0| 2   |
|2|2| 3   |

I have a separate DataFrame containing the label for every row i.

If possible I'd prefer the solution to use the newer ml library rather than the older mllib


Below I give a small code example of how to implement distributed sparse linear regression in spark ml. I've used it with the matrix in question on a large cluster (Databricks Runtime version 6.5 ML - includes Apache Spark 2.4.5, Scala 2.11) so it scales well and took just a few minutes to execute.

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.Dataset
import org.apache.spark.ml.linalg.SparseVector
import org.apache.spark.ml.feature.LabeledPoint
import spark.implicits._
import org.apache.spark.ml.regression.LinearRegression

// Construct Matrix coordinate representation DataFrame
val df = Seq(
  (0, 1, 14.0), 
  (0, 0, 13.0), 
  (1, 1, 11.0)
).toDF("i", "j", "value")


|  i|  j|value|
|  0|  1| 14.0|
|  0|  0| 13.0|
|  1|  1| 11.0|

// Construct label DataFrame
val df_label = Seq(
  (0, 41.1), 
  (1, 21.9) // beta_1 = 1, beta_2 = 2
).toDF("i", "label")


|  i|label|
|  0| 41.1|
|  1| 21.9|

// Use a UDF to sort arrays below
val sortUdf: UserDefinedFunction = udf((rows: Seq[Row]) => {
  rows.map { case Row(j: Int, value: Double) => (j, value) }
    .sortBy { case (j, value) => j }

// collect j and value columns to lists, make sure they are sorted by j
// then join with labels
val df_collected_with_labels = df
.agg(collect_list(struct("j", "value")) as "j_value")
.select($"i", sortUdf(col("j_value")).alias("j_value_list"))
.withColumn("j_list", $"j_value_list".getField("_1"))
.withColumn("value_list", $"j_value_list".getField("_2"))
.join(df_label, "i")

|  i|j_list|  value_list|label|
|  1|   [1]|      [11.0]| 21.9|
|  0|[0, 1]|[13.0, 14.0]| 41.1|

val unique_j = df.dropDuplicates("j").count().toInt

val sparse_df = df_collected_with_labels
.map(r => LabeledPoint(r.getDouble(3), 
                       new SparseVector(size = unique_j, 
                                        indices = r.getAs[Seq[Int]]("j_list").toArray, 
                                        values = r.getAs[Seq[Double]]("value_list").toArray)))


|label|            features|
| 21.9|      (2,[1],[11.0])|
| 41.1|(2,[0,1],[13.0,14...|

// Fit sparse regression!
val lr = new LinearRegression()

val lrModel = lr.fit(sparse_df)

org.apache.spark.ml.linalg.Vector = [1.0174825174825193,1.9909090909090894]

lrModel.predict(new SparseVector(size = unique_j, indices = Array(0), values = Array(4.0)))
Double = 4.069930069930077