3
votes

I would like to do 2 things:

  1. Display out the contents of the RDD splitRDD to the console.
  2. Save the results to a text file.

The 3rd line of scala code below prints out the key, but I am looking for the value.

val emailMsg = sc.textFile(file);`
val splitRDD = emailMsg.map( line => line.split("."));
splitRDD.foreach(println);
splitRDD.coalesce(1).saveAsTextFile("newfile")
2
Can you please add schema of splitRDD.Anurag Sharma
Not sure of what do you mean by "The 3rd line of scala code below prints out the key". The split method outputs an array, whose toString method will not actually print any member of the array itself. If you only want to print only the second item (for example) you should do something like: splitRDD.foreach(row => row(1)).stefanobaghino
why are you doing line => line.split(".")? Can you give sample file input and your expected print output?Gsquare
Thanks for your help. I am doing a split on email messages (text file) and trying to separate out the from, to, date and subject. So, I am looking words like "Subject:", and "To:" and so.Steve McAffer

2 Answers

1
votes

I would assume that your file looks like this

key1.value1
key2.value2

And you want to print and save either values or pairs in some other format.

If you want to print and save just values you can transform splitRDD into just values RDD.

val valRDD = splitRDD.map( _( 1 ) )
valRDD.foreach( println )

Note that saveAsTextFile doesn't save the file in easy to use format so you'll probably need a simple text writer (Java PrintWriter will do just fine).

Example to print and save splitRDD in two different formats

import org.apache.spark._
import java.io.{ PrintWriter, File, FileOutputStream }

...

val pwText = new PrintWriter(
    new File( "emailMsgValues.txt" )
)

val pwCSV = new PrintWriter(
    new File( "emailMsgPair.csv" )
)

val emailMsg = sc.textFile( "data/emailMsg.txt" )

val splitRDD = emailMsg.map( line => line.split( '.' ) )

println( "Printing and writing values in text" )

val valRDD = splitRDD.map( _( 1 ) ).collect()

valRDD.foreach( value => {

    println( value )
    pwText.write( value + "\n" )
} )

println( "Printing and writing pairs in csv" )

splitRDD.collect().foreach( pair => {

    println( pair.mkString( "," ) )
    pwCSV.write( pair.mkString( "," ) + "\n" )

} )

pwText.close()
pwCSV.close()
1
votes

What you are saying the third line is printing is not the key. It is actually printing the Array Object, something like this

[Ljava.lang.String;@384efaf
[Ljava.lang.String;@5bc8b97c
[Ljava.lang.String;@18194125
[Ljava.lang.String;@364838ab
[Ljava.lang.String;@254b1df2

What you need to do is to convert that Array object to a Row object of spark sql. So use

import org.apache.spark.sql.Row

and change the second line of your code like this

val splitRDD = emailMsg.map( line => Row.fromSeq(line.split(" ")))