1
votes

currently two avro files are getting generated for 10 kb file, If I follow the same thing with my actual file (30MB+) I will n number of files.

so need a solution to generate only one or two .avro files even if the source file of large.

Also is there any way to avoid manual declaration of column names.

current approach...

spark-shell --packages com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.10:2.0.1

import org.apache.spark.sql.types.{StructType, StructField, StringType}

// Manual schema declaration of the 'co' and 'id' column names and types val customSchema = StructType(Array( StructField("ind", StringType, true), StructField("co", StringType, true)))

val df = sqlContext.read.format("com.databricks.spark.csv").option("comment", "\"").option("quote", "|").schema(customSchema).load("/tmp/file.txt")

df.write.format("com.databricks.spark.avro").save("/tmp/avroout")

// Note: /tmp/file.txt is input file/dir, and /tmp/avroout is the output dir

1
Sorry I don't get it: So you have one input file and you want to generate 2 avro files out of it (instead of n you have right now). Is this correct? So your question is about how to do it with spark?TobiSH

1 Answers

1
votes

Try specifying number of partitions of your dataframe while writing the data as avro or any format. To fix this use repartition or coalesce df function.

df.coalesce(1).write.format("com.databricks.spark.avro").save("/tmp/avroout")

So that it writes only one file in "/tmp/avroout"

Hope this helps!