0
votes

I would like to convert csv to parquet using spark-csv. Reading the file and saving it as a dataset works. Unfortunately i can't write it back as a parquet file. Is there any way to achieve this?

SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example")
        .config("spark.master", "local").config("spark.sql.warehouse.dir", "file:///C:\\spark_warehouse")
        .getOrCreate();

Dataset<Row> df = spark.read().format("com.databricks.spark.csv").option("inferSchema", "true")
        .option("header", "true").load("sample.csv");

df.write().parquet("test.parquet");

17/04/11 09:57:32 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.NoSuchMethodError: org.apache.parquet.column.ParquetProperties.builder()Lorg/apache/parquet/column/ParquetProperties$Builder; at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:362) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:350) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:145) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.(FileFormatWriter.scala:234) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:182) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

1
Which version of spark you are using ?Sandeep Singh
Can you try to do : df.show and see if that works?Sanchit Grover
Spark 2.1.0 built for Hadoop 2.7.3. And all pom-dependency versions are adjusted to 2.1.0br0ken.pipe
yes, df.show() shows me the top 20 records.br0ken.pipe
try df.printSchema(). If you are getting the schema then try with df.write.parque. If you are not getting schema then you need to register your df as temporary table by supplying schema name.Sandeep Singh

1 Answers

1
votes

I fixed it with a workaround. I had to comment out these two parquet dependencies, but i'm not really sure why they get in each other's way:

<!--        <dependency> -->
<!--            <groupId>org.apache.parquet</groupId> -->
<!--            <artifactId>parquet-hadoop</artifactId> -->
<!--            <version>1.9.0</version> -->
<!--        </dependency> -->


<!--        <dependency> -->
<!--            <groupId>org.apache.parquet</groupId> -->
<!--            <artifactId>parquet-common</artifactId> -->
<!--            <version>1.9.0</version> -->
<!--        </dependency> -->