3
votes

I have read other question and I am confused about the option. I want to read a Athena view in EMR spark and from searching on google/stackoverflow, I realized that these view are somehow stored in S3, so I first tried to find the external location of the view through

Describe mydb.Myview

It provides schema but doesnt provide the external location. From which I assumed that I cannot read it as Dataframe from S3

What i have considered so far for reading athena view in Spark

I have considered following options

  1. Make a new table out of this athena VIEW using WITH statment with external format as PARQUET

    CREATE TABLE Temporary_tbl_from_view WITH ( format = 'PARQUET', external_location = 's3://my-bucket/views_to_parquet/', ) AS ( SELECT * FROM "mydb"."myview"; );

  1. Another option is based on this answer,which suggests

When you start an EMR cluster (v5.8.0 and later) you can instruct it to connect to your Glue Data Catalog. This is a checkbox in the 'create cluster' dialog. When you check this option your Spark SqlContext will connect to the Glue Data Catalog, and you'll be able to see the tables in Athena.

but I am not sure how can I query this view (not table) in pyspark if athena table/views are available through Glue catalogue in spark context, will the simple statement like this work?

sqlContext.sql("SELECT * from mydbmyview")

Question, What is the more effecient way to read this view in spark, does recreating a table using WITH statement (external location) means that I am storing this thing in Glue catalog or S3 twice? If yes, How can I read it directly through S3 or glue catalog?

1
AB were you able to solve this?Alfredo Lozano
@AlfredoLozano I have posted my solution as answer, hope it helps. Sorry for delay.A.B

1 Answers

1
votes

Just to share the solution I followed with others, I created my cluster with the following option enabled

Use AWS Glue Data Catalog for table metadata

Afterwards, I saw the database name from AWS GLUE and Was able to see the desired view in tablename as below

spark.sql("use my_db_name")
spark.sql("show tables").show(truncate=False)
+------------+---------------------------+-----------+
|database    |tableName                  |isTemporary|
+------------+---------------------------+-----------+
|  my_db_name|tabel1                     |false      |
|  my_db_name|desired_table              |false      |
|  my_db_name|tabel3                     |false      |
+------------+---------------------------+-----------+