4
votes

I have 2 JavaRDDs. The first one is

JavaRDD<CustomClass> data

and the second one is

JavaRDD<Vector> features

My Custom class has 2 fields, (String) text and (int) label. I have 1000 instances of CustomClass in my JavaRDD data and 1000 instances of Vector in the JavaRDD features.

I have computed these 1000 vectors by using the JavaRDD data and applying a map function on it.

Now, I want to have a new JavaRDD of the form

JavaRDD<LabeledPoint>

Since the constructor of a LabeledPoint requires a label and a vector, I am unable to apply a map function which has both CustomClass and the Vector as an argument to the call function since it accepts only one argument.

Can someone please tell me how to combine these two JavaRDDs and get the new

JavaRDD<LabeledPoint> 

?

Here are some snippets from the code I wrote :

    Class CustomClass {
        String text; int label;
    }

    JavaRDD<CustomClass> data = getDataFromFile(filename);

    final HashingTF hashingTF = new HashingTF();
    final IDF idf = new IDF();
    final JavaRDD<Vector> td2 = data.map(
            new Function<CustomClass, Vector>() {
                @Override
                public Vector call(CustomClass cd) throws Exception {
                    Vector v = new DenseVector(hashingTF.transform(Arrays.asList(cd.getText().split(" "))).toArray());
                    return v;
                }
            }
    );

    final JavaRDD<Vector> features = idf.fit(td2).transform(td2);
1

1 Answers

4
votes

You can use JavaRDD#zip:

Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other).

JavaPairRDD<CustomClass,Vector> dataAndFeatures = data.zip(features);
// TODO dataAndFeatures.map to LabeledPoint instances

The highlighted part of the docs holds, since you create td2 by simple map of the data. And then df (==features?) is result of transform on IDFModel instance, that also keeps the values aligned.