0
votes

I am simply trying to create a table in hive that is stored as a parquet file, and then transform the csv file that holds the data into a parquet file, and then load it into the hdfs directory to insert the values.below is my sequence that I am doing but to no avail:

First I created a table in Hive:

CREATE external table  if not EXISTS db1.managed_table55 (dummy string)
stored as parquet
location '/hadoop/db1/managed_table55';

Then i loaded a parquet file into the above hdfs location using this spark:

df=spark.read.csv("/user/use_this.csv", header='true')
df.write.save('/hadoop/db1/managed_table55/test.parquet', format="parquet")

It loads but here is the output......all null values: enter image description here

Here is the original values in the use_this.csv file that I transformed into a parquet file: enter image description here

This is proof that the specified location created the table's folder(managed_table55) and the file(test.parquet): enter image description here

enter image description here

Any ideas or suggestions as why this keeps happening? I know there's probably a minor tweak but I cant seem to identify it.

1

1 Answers

2
votes

As you are writing parquet file to /hadoop/db1/managed_table55/test.parquet this location Try creating table in the same location and read the data from hive table.

Create Hive Table:

hive> CREATE external table  if not EXISTS db1.managed_table55 (dummy string)
stored as parquet
location '/hadoop/db1/managed_table55/test.parquet';

Pyspark:

df=spark.read.csv("/user/use_this.csv", header='true')
df.write.save('/hadoop/db1/managed_table55/test.parquet', format="parquet")