1
votes

I am working with a data set that has some duplicate rows. The rows are not straight duplicates, but have a time stamp less than a second apart. I'd like to remove these duplicates, but the question is how.

My current plan is to add two new columns, which are copies of the time stamp column but one has a second added to it and the other has a second removed from it. I can then add steps to remove rows which have all other values the same, but have the same time stamp as time stamp plus one or minus one. Doing one after the other should eliminate duplicates but not remove truly unique rows.

How can I accomplish this in Power Query?

1
So "duplicate" rows are going to be less than a second apart. How far apart are "distinct" rows going to be? Is there a lower bound?Alexis Olson
Can you show an example? In power query you can select with Ctrl each column that you consider and then remove duplicates on base of those columns.virtualdvid

1 Answers

3
votes

I think your "current plan" approach is good - I would apply that in a separate Query, started "By Reference" to the original - I'd call it something like Non-duplicated time stamps.

I would duplicate the original time stamp column and then add the new +/- 1 minute columns. I would use Unpivot Only Selected Columns on the 3 added time stamp columns to convert them from columns to rows. Then I would select the generated Value column and apply Keep Duplicates. That will keep just the first row of any duplicates found amongst the 3 time stamps.

Then back in the original query, I would add a Merge Queries step to connect it to the Non-duplicated time stamps query. I would match on the original time stamp column, possibly on other columns if required. The Join Kind would be Left Anti (rows only in first). That should remove your duplicates.