1
votes

I am trying to convert a kafka message which is a huge RDD to parquet format and save in HDFS using spark streaming. Its a syslog message, like name1=value1|name2=value2|name3=value3 in each line, any pointers on how to achieve this in spark streaming ?

2

2 Answers

2
votes

You can save an RDD to parquet without converting to DataFrame as long as you have an avro schema for it

here is a sample function:

public <T> void save(JavaRDD<T> rdd, Class<T> clazz, Time timeStamp, Schema schema, String path) throws IOException {
    Job job = Job.getInstance();
    ParquetOutputFormat.setWriteSupportClass(job, AvroWriteSupport.class);
    AvroParquetOutputFormat.setSchema(job, schema);
    LazyOutputFormat.setOutputFormatClass(job, new ParquetOutputFormat<T>().getClass());
    job.getConfiguration().set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false"); 
    job.getConfiguration().set("parquet.enable.summary-metadata", "false"); 

    //save the file
    rdd.mapToPair(me -> new Tuple2(null, me))
            .saveAsNewAPIHadoopFile(
                    String.format("%s/%s", path, timeStamp.milliseconds()),
                    Void.class,
                    clazz,
                    LazyOutputFormat.class,
                    job.getConfiguration());
}