0
votes

I'm trying to work a ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) on a bigquery script and keep running into the 'exceeded resources' error.

This Table's size is 219.96 GB and Number of rows 1,611,220,127

here's the script:

With cte as (
SELECT 
    Source, 
    ROW_NUMBER() OVER (PARTITION BY FirstName, LastName, MiddleName, Address, Address2, City, State, Zip ORDER BY Attom_ID DESC) as rnk
    ,FirstName, LastName, MiddleName, Gender, Age, DOB, Address, Address2, City, State, Zip, Zip4, TimeZone, Income, HomeValue, Networth, MaritalStatus, IsRenter, HasChildren, CreditRating, Investor, LinesOfCredit, InvestorRealEstate, Traveler, Pets, MailResponder, Charitable, PolicalDonations, PoliticalParty, Attom_ID, GEOID, Score, Score1, Score2, Score3, Score4, Score5, Latitude, Longitude
    from `db.ds.tblA`
) select * from cte where rnk = 1

While this is one table, it's a result of a join, where all of the columns PRIOR to ATTOM_ID are from one table and ATTOM_ID, GEOID, and on comes from the second. I believe there's some cartesian-ing in the result set.

There are a variety of duplicates in the table, and I'm trying to deduplicate the table. I was afraid of trying GROUP_BY with MAX(ATTOM_ID), because I wanted to make sure I go the correctly associated GEOID and SCORES to come with. I don't want to mix bag it.

The problem is that this particular query is exceeding resources, so I'd like to know if there's any options I have here. Thanks!

1
Can you show us the schema of the original tables and some example rows and some example result rows? I think for optimizing this it's better to start out with the original data (structure) instead of the joined result that is already too big.Corion
@Corion - I think the original stuff is in stackoverflow.com/q/52724840/5221944Mikhail Berlyant

1 Answers

2
votes

Below is equivalent of your original query (result-wise) and usually addresses 'exceeded resources' issue

#standardSQL
SELECT r.* FROM (
  SELECT 
    ARRAY_AGG(STRUCT(Source,FirstName, LastName, MiddleName, Gender, Age, DOB, Address, Address2, City, State, Zip, Zip4, TimeZone, Income, HomeValue, Networth, MaritalStatus, IsRenter, HasChildren, CreditRating, Investor, LinesOfCredit, InvestorRealEstate, Traveler, Pets, MailResponder, Charitable, PolicalDonations, PoliticalParty, Attom_ID, GEOID, Score, Score1, Score2, Score3, Score4, Score5, Latitude, Longitude) ORDER BY Attom_ID DESC LIMIT 1)[OFFSET(0)]
  FROM `db.ds.tblA`
  GROUP BY FirstName, LastName, MiddleName, Address, Address2, City, State, Zip
)