0
votes

running spark sql in aws glue returns the column name in the queries

data:

product,price,quantityinKG

mango,100,1

apple,200,3

peach,200,2

mango,200,2

My Test Query

         eg : select product,sum(price)

              from myDataSource

              group by product

The output of the query should be

            product, sum(price)

            mango, 300

but output is :

            product, "sum(price)"

            mango,

There is nothing in the sum(price) column it only has the product nane , please can you help me with this behaviour of glue

1
How did you get this data? What type of source was it? If using a crawler perhaps the column names were grabbed as part of the data, which can happen if your settings don't align to your source data file. - jonlegend
I crawled the data from s3 bucket actually using a glue crawler. have you worked with AWS GLUE - bigDataArtist
I'm very familiar with glue. What type of data are you crawling? CSV? Can you provide some example data? Are you using a classifier on your crawler? If so, what are the settings on it? - jonlegend
@jonlegend I have mentioned the dataset in the first line of my question. Yes I am using an inbuilt csv classifier. I didn't get the last part . what are the setting on ? Please can you elaborate the last part - bigDataArtist
Does your data have "data:" as the first line as in the code block above? - jonlegend

1 Answers

0
votes

First of all, create or replace a local temporary view with your dataframe and then use the sql query

data.createOrReplaceTempView('data_table')
spark.sql("select product, sum(price) as sum_price from data_table group by product").show()

If you are using the glue dynamic dataframe, then first convert it into a spark dataframe using toDF() function before creating the temp view.