0
votes

I'm trying to write a pyspark df to Snowflake using a function I've written:

def s3_to_snowflake(schema, table):

    df = get_dataframe(schema, table, sqlContext)

    username = user
    password = passw
    account = acct

    snowflake_options = {
        "sfURL" : account+".us-east-1.snowflakecomputing.com",
        "sfAccount" : account,
        "sfUser" : username,
        "sfPassword" : password,
        "sfDatabase" : "database",
        "sfSchema" : schema,
        "sfWarehouse" : "demo_wh"
    }

    sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "KeyId")
    sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", 
"AccessKey")

    (
      df
      .write
      .format("net.snowflake.spark.snowflake")
      .mode("overwrite")
      .options(**snowflake_options)
      .option("dbtable", table)
      .option('tempDir', 's3://data-temp-loads/snowflake')
      .save()
    )

    print('Wrote {0} to {1}.'.format(table, schema))

This function has worked for all but one of the tables I've got in my datalake. This is the schema of the table I'm trying to write.

root
|-- credit_transaction_id: string (nullable = true)
|-- credit_deduction_amt: double (nullable = true)
|-- credit_adjustment_time: timestamp (nullable = true)

The error I'm getting looks like Snowflake is taking issue with that DoubleType column. I've had this issue before with Hive when using Avro/ORC filetypes. Usually it's a matter of casting one datatype to another.

Things I've tried:

  • Casting (Double to Float, Double to String, Double to Numeric–this last one per the Snowflake docs )
  • Rerunning DDL of the incoming table, trying Float, String, and Numeric types

One other thing of note: some of the tables that I've transferred successfully have columns of DoubleType. Unsure of what the issue with this table is.

1

1 Answers

1
votes

After poking around online, it seems to me that this error is being thrown by Spark's Parquet reader:

https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

Are your files defining df Parquet? I think this may be a read error instead of a write error; it might be worth taking a look at what's going on in get_dataframe.

Thanks, etduwx