Schema error after altering hive table with pyspark

Question

I have a table in hive with called test with columns id and name

Now I have another table in hive called mysql with columns id, name and city.

Now I want to compare schema of both tables and add column difference to the hive table test.

hive_df= sqlContext.table("testing.test")

mysql_df= sqlContext.table("testing.mysql")

hive_df.dtypes

[('id', 'int'), ('name', 'string')]

mysql_df.dtypes

[('id', 'int'), ('name', 'string'), ('city', 'string')]

hive_dtypes=hive_df.dtypes

hive_dtypes

[('id', 'int'), ('name', 'string')]


mysql_dtypes= mysql_df.dtypes

diff = set(mysql_dtypes) ^ set(hive_dtypes)

diff

set([('city', 'string')])

for col_name, col_type in diff:
...  sqlContext.sql("ALTER TABLE testing.test ADD COLUMNS ({0} {1})".format(col_name, col_type))
...

After doing all this the hive table test will have new column city added with null values as expected.

Now when I close the spark session and open a new spark session and when I do

hive_df= sqlContext.table("testing.test")

and then

hive_df

I should get

DataFrame[id: int, name: string, city: string]

But I get this

DataFrame[id: int, name: string]

When I do a desc hive table test

hive> desc test;
OK
id                      int
name                    string
city                    string

Why is the schema change not reflecting in the Pyspark dataframe after we alter the corresponding hive table?

FYI I am using spark 1.6

How is that Hive table stored? Is that a columnar format with the schema stored in each file -- such as ORC or Parquet? (for Parquet you should read carefully spark.apache.org/docs/1.6.3/…) — Samson Scharfrichter

User12345 User12345 · Accepted Answer · 2017-03-25T01:12:57

Looks like there is a Jira for this issue https://issues.apache.org/jira/browse/SPARK-9764 which has been fixed in Spark 2.0.

For those using spark 1.6, try creating tables using sqlContext.

Like first register the data frame as temp table and then do

sqlContext.sql("create table table as select * from temptable")

This way after you alter the hive table and when you recreate the spark data frame, the df will have the newly added columns as well.

This issue was resolved with the help of @zero323

Schema error after altering hive table with pyspark

1 Answers