1
votes

I am finding it difficult to load parquet files into hive tables. I am working on Amazon EMR cluster and spark for Data processing. But i need to read the output parquet files to validate my transformations. i have the parquet files with following schema:

root
 |-- ATTR_YEAR: long (nullable = true)
 |-- afil: struct (nullable = true)
 |    |-- clm: struct (nullable = true)
 |    |    |-- amb: struct (nullable = true)
 |    |    |    |-- L: string (nullable = true)
 |    |    |    |-- cdTransRsn: string (nullable = true)
 |    |    |    |-- dist: struct (nullable = true)
 |    |    |    |    |-- T: string (nullable = true)
 |    |    |    |    |-- content: double (nullable = true)
 |    |    |    |-- dscStrchPurp: string (nullable = true)
 |    |    |-- amt: struct (nullable = true)
 |    |    |    |-- L: string (nullable = true)
 |    |    |    |-- T: string (nullable = true)
 |    |    |    |-- content: double (nullable = true)
 |    |    |-- amtTotChrg: double (nullable = true)
 |    |    |-- cdAccState: string (nullable = true)
 |    |    |-- cdCause: string (nullable = true)

how can i create hive external table using this type of schema and load the parquet files into that hive table for analysis?

1

1 Answers

0
votes

You can use Catalog.createExternalTable (Spark before 2.2) or Catalog.createTable (Spark 2.2 and later).

Catalog instance can be accessed using SparkSession:

val spark: SparkSession
spark.catalog.createTable(...)

Session should be initialized with Hive support enabled.