1
votes

I have one spark dataset Dataset<T> loaded from Cassandra Table, and I want to apply list of operations (chain or pipeline) on this dataset.

For example:

Dataset<T> dataset= sparkSession.createDataset(javaFunctions(spark.sparkContext())
                    .cassandraTable(...));

Dataset<Row> result = dataset.apply(func1()).apply(func2()).apply(func3());

func1() will replace null values with most frequent ones.

func2() will add new columns with new values.

func3() ....etc.

What is the best way to apply this pipeline of functions?

3

3 Answers

1
votes

If your functions accept Datasets and return Datasets, ie. have the signature:

public Dataset[U] myMethod(Dataset[T] ds) {
  ...
}

Then you can use the transform method defined on a Dataset to neatly apply your functions.

ds.tranform(myMethod)
  .transform(myMethod1)
  .transform(myMethod2)

If the functions are on standard Java objects, eg.

public U myMethod(T row) {
  ...
}

Then you want the map method defined on a Dataset.

ds.map(myMethod)
  .map(myMethod1)
  .map(myMethod2)

Full API docs: https://spark.apache.org/docs/2.3.0/api/java/index.html?org/apache/spark/sql/Dataset.html

1
votes

Thanks to answer of @wade-jensen!

Here is the complete solution:

Dataset<myClass> dataset = ....
Dataset<myClass> new_dataset = dataset.transform(method1(someParamters));

private static Function1<Dataset<myClass>, Dataset<myClass>> method1(someParamters) {
    return new AbstractFunction1<Dataset<myClass>, Dataset<myClass>>() {
        @Override
        public Dataset<myClass> apply(Dataset<myClass> dataset) {

           ...... some work here .... 

            return dataset;
        }
    };
}
0
votes

If you need to apply your function to each row you can use "map" operation.