3
votes

I am trying to read a Parquet file from Azure Data Lake using the following Pyspark code.

df= sqlContext.read.format("parquet")
   .option("header", "true")
   .option("inferSchema", "true")
   .load("adl://xyz/abc.parquet")
df = df['Id','IsDeleted']

Now I would like to load this dataframe df as a table in sql dataware house using the following code:

df.write \
  .format("com.databricks.spark.sqldw") \
  .mode('overwrite') \
  .option("url", sqlDwUrlSmall) \
  .option("forward_spark_azure_storage_credentials", "true") \
  .option("dbtable", "test111") \
  .option("tempdir", tempDir) \
  .save()

This creates a table dbo.test111 in the SQL Datawarehouse with datatypes:

  • Id(nvarchar(256),null)
  • IsDeleted(bit,null)

But I need these columns with different datatypes say char(255), varchar(128) in SQL Datawarehouse. How do I do this while loading the dataframe into SQL Dataware house?

2

2 Answers

0
votes

I found a way can help you modify the column data type, but maybe could not achieve your want.

df.select(col("colname").cast(DataType))

Here is a blob about How to change column types in Spark SQL's DataFrame.

Maybe this can helps you.

0
votes

The only supported data types on Spark SQL are given [https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/types/package-summary.html][1]

String types will in fact be turned into VARCHAR with unspecified length. Spark SQL does not have VARCHAR(n) data type.

You should be able to do something like below

import org.apache.spark.sql.types._

val df = 
          df.withColumn("Id_mod", df.Id.cast(StringType)) 
            .withColumn("IsDeleted_mod", df.IsDeleted.cast(StringType))
            .drop("Id")
            .drop("IsDeleted")
            .withColumnRenamed("Id_mod", "Id")
            .withColumnRenamed("IsDeleted_mod", "IsDeleted")
            //Replace StringType with Any supported desired type