I'm trying to work a ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) on a bigquery script and keep running into the 'exceeded resources' error.
This Table's size is 219.96 GB and Number of rows 1,611,220,127
here's the script:
With cte as (
SELECT
Source,
ROW_NUMBER() OVER (PARTITION BY FirstName, LastName, MiddleName, Address, Address2, City, State, Zip ORDER BY Attom_ID DESC) as rnk
,FirstName, LastName, MiddleName, Gender, Age, DOB, Address, Address2, City, State, Zip, Zip4, TimeZone, Income, HomeValue, Networth, MaritalStatus, IsRenter, HasChildren, CreditRating, Investor, LinesOfCredit, InvestorRealEstate, Traveler, Pets, MailResponder, Charitable, PolicalDonations, PoliticalParty, Attom_ID, GEOID, Score, Score1, Score2, Score3, Score4, Score5, Latitude, Longitude
from `db.ds.tblA`
) select * from cte where rnk = 1
While this is one table, it's a result of a join, where all of the columns PRIOR to ATTOM_ID are from one table and ATTOM_ID, GEOID, and on comes from the second. I believe there's some cartesian-ing in the result set.
There are a variety of duplicates in the table, and I'm trying to deduplicate the table. I was afraid of trying GROUP_BY with MAX(ATTOM_ID), because I wanted to make sure I go the correctly associated GEOID and SCORES to come with. I don't want to mix bag it.
The problem is that this particular query is exceeding resources, so I'd like to know if there's any options I have here. Thanks!