Executing spark sql in aws glue returns the column name in the queries rather than values

Question

running spark sql in aws glue returns the column name in the queries

data:

product,price,quantityinKG

mango,100,1

apple,200,3

peach,200,2

mango,200,2

My Test Query

         eg : select product,sum(price)

              from myDataSource

              group by product

The output of the query should be

            product, sum(price)

            mango, 300

but output is :

            product, "sum(price)"

            mango,

There is nothing in the sum(price) column it only has the product nane , please can you help me with this behaviour of glue

How did you get this data? What type of source was it? If using a crawler perhaps the column names were grabbed as part of the data, which can happen if your settings don't align to your source data file. — jonlegend
I crawled the data from s3 bucket actually using a glue crawler. have you worked with AWS GLUE — bigDataArtist
I'm very familiar with glue. What type of data are you crawling? CSV? Can you provide some example data? Are you using a classifier on your crawler? If so, what are the settings on it? — jonlegend
@jonlegend I have mentioned the dataset in the first line of my question. Yes I am using an inbuilt csv classifier. I didn't get the last part . what are the setting on ? Please can you elaborate the last part — bigDataArtist
Does your data have "data:" as the first line as in the code block above? — jonlegend

Mahesh Nema Mahesh Nema · Accepted Answer · 2021-06-29T13:42:56

First of all, create or replace a local temporary view with your dataframe and then use the sql query

data.createOrReplaceTempView('data_table')
spark.sql("select product, sum(price) as sum_price from data_table group by product").show()

If you are using the glue dynamic dataframe, then first convert it into a spark dataframe using toDF() function before creating the temp view.