Read HDFS files using Hive metadata - Pyspark

Question

I am new to PySpark and trying to read HDFS files (which has hive tables created on top of it) and create PySpark dataframes. Reading Hive tables through PySpark is time consuming. Are there any ways I can get hive column names (to use as schema in dataframe) dynamically?

I am looking to pass file location, table name and database name as inputs to aa program/function to get the schema/column name from hive metadata (probably metadata xml) and return as dataframe.

Please advise

cronoik cronoik · Accepted Answer · 2019-03-07T01:03:13

You can get a list of the column names by calling dataframe.column

df1 = spark.sql('select * from bla')
df1.columns
['col1', 'col2']

The printschema method will help you in case you need the column types

df1.printSchema()
root 
|-- col1: long (nullable = true) 
|-- col2: long (nullable = true)

Both methods do not read any data from the tables except of the schema. Another thing you could try when your are trying to improve the performance, is storing the tables in parquet format. You can do that with the following command:

df1.write.mode("overwrite").saveAsTable("blaASParquet")

Parquet is column based storage and that is benefical for most aggregation methods.

Read HDFS files using Hive metadata - Pyspark

1 Answers