I have data being written into AWS S3 through kafka and hence the message can be written more than once. I'm using Spectrum Redshift to query it.
Since the data is nested, I've faced some limitations explained in the documentation, but I can't seem to work a way to deduplicate it.
I've first attempted the following to get rid of the duplicates using group by and then doing my usual aggregation
with rid_of_duplicates as (
select table.field_1, table.struct_1.field_2, table.struct_1.field_3
from table
group by 1,2,3
)
select field_1 || field_2, count(field_3)
from rid_of_duplicates
group by field_1, field_2
But I get the following error
ERROR: Spectrum nested query error Detail: ----------------------------------------------- error: Spectrum nested query error code: 8001 context: Items in the FROM clause of aggregation subqueries have to refer to nested tables of higher level FROM clauses. query: 0 location: nested_query_rewriter.cpp:814 process: padbmaster [pid=6630] ----------
If I get rid of the GROUP BY
in the subquery, the query works fine, but its output is wrong.
Reading the docs, it seems that I wasn't supposed to select specific fields in the subquery. But as I'm using nested data, I can't use SELECT *
.
I've also attempted creating a view with no schema binding, but that failed as well. SELECT Distinct
doesn't work as well
If possible, I'd not like to create a table for that, since the data is being stored in S3. Also, count(distinct field_3) doesn't work due to business logic