I would like to convert csv to parquet using spark-csv. Reading the file and saving it as a dataset works. Unfortunately i can't write it back as a parquet file. Is there any way to achieve this?
SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example")
.config("spark.master", "local").config("spark.sql.warehouse.dir", "file:///C:\\spark_warehouse")
.getOrCreate();
Dataset<Row> df = spark.read().format("com.databricks.spark.csv").option("inferSchema", "true")
.option("header", "true").load("sample.csv");
df.write().parquet("test.parquet");
17/04/11 09:57:32 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.NoSuchMethodError: org.apache.parquet.column.ParquetProperties.builder()Lorg/apache/parquet/column/ParquetProperties$Builder; at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:362) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:350) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:145) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.(FileFormatWriter.scala:234) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:182) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
df.printSchema()
. If you are getting the schema then try withdf.write.parque
. If you are not getting schema then you need to register yourdf
as temporary table by supplying schema name. – Sandeep Singh