Writing a parquet file from Spark-CSV

Question

I would like to convert csv to parquet using spark-csv. Reading the file and saving it as a dataset works. Unfortunately i can't write it back as a parquet file. Is there any way to achieve this?

SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example")
        .config("spark.master", "local").config("spark.sql.warehouse.dir", "file:///C:\\spark_warehouse")
        .getOrCreate();

Dataset<Row> df = spark.read().format("com.databricks.spark.csv").option("inferSchema", "true")
        .option("header", "true").load("sample.csv");

df.write().parquet("test.parquet");

17/04/11 09:57:32 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.NoSuchMethodError: org.apache.parquet.column.ParquetProperties.builder()Lorg/apache/parquet/column/ParquetProperties$Builder; at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:362) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:350) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:145) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.(FileFormatWriter.scala:234) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:182) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

Spark 2.1.0 built for Hadoop 2.7.3. And all pom-dependency versions are adjusted to 2.1.0 — br0ken.pipe
try df.printSchema(). If you are getting the schema then try with df.write.parque. If you are not getting schema then you need to register your df as temporary table by supplying schema name. — Sandeep Singh

br0ken.pipe br0ken.pipe · Accepted Answer · 2017-04-11T08:38:04

I fixed it with a workaround. I had to comment out these two parquet dependencies, but i'm not really sure why they get in each other's way:

<!--        <dependency> -->
<!--            <groupId>org.apache.parquet</groupId> -->
<!--            <artifactId>parquet-hadoop</artifactId> -->
<!--            <version>1.9.0</version> -->
<!--        </dependency> -->


<!--        <dependency> -->
<!--            <groupId>org.apache.parquet</groupId> -->
<!--            <artifactId>parquet-common</artifactId> -->
<!--            <version>1.9.0</version> -->
<!--        </dependency> -->

Writing a parquet file from Spark-CSV

1 Answers