How to create pipeline of functions in Spark that have to be applied on a dataset?

Question

I have one spark dataset Dataset<T> loaded from Cassandra Table, and I want to apply list of operations (chain or pipeline) on this dataset.

For example:

Dataset<T> dataset= sparkSession.createDataset(javaFunctions(spark.sparkContext())
                    .cassandraTable(...));

Dataset<Row> result = dataset.apply(func1()).apply(func2()).apply(func3());

func1() will replace null values with most frequent ones.

func2() will add new columns with new values.

func3() ....etc.

What is the best way to apply this pipeline of functions?

Wade Jensen Wade Jensen · Accepted Answer · 2018-06-23T12:08:44

If your functions accept Datasets and return Datasets, ie. have the signature:

public Dataset[U] myMethod(Dataset[T] ds) {
  ...
}

Then you can use the transform method defined on a Dataset to neatly apply your functions.

ds.tranform(myMethod)
  .transform(myMethod1)
  .transform(myMethod2)

If the functions are on standard Java objects, eg.

public U myMethod(T row) {
  ...
}

Then you want the map method defined on a Dataset.

ds.map(myMethod)
  .map(myMethod1)
  .map(myMethod2)

Full API docs: https://spark.apache.org/docs/2.3.0/api/java/index.html?org/apache/spark/sql/Dataset.html

How to create pipeline of functions in Spark that have to be applied on a dataset?

3 Answers