How to remove duplicates from individual powerquery columns without removing entire rows

Question

I have a data table that records cost savings data and I have 1 row per project. This has overall project type data such as annual spend, annual savings, etc. but also has the months the savings fall into. To pivot on this data, I converted it to a table with PowerQuery but some columns repeat such as annual spend for each month where there are savings so I might get 10 rows for savings which is correct, but the annual spend is duplicated 10 times. Can I remove duplicates in just those columns retaining the other data.

I have searched and tried various solutions but haven't found one that works. I am not set on data table format, so am open to anything.

Below is a sample of the data

Sample of PowerQuery

As you will see, Baseline Spend, Negotiated Spend, Savings Amount are all shown for each row and I need to use these in a pivot/slicer.

Any help would be appreciated.

Regards,

Keith

Hard to experiment with just a screenshot, or an idea of what you want the results to look like, but maybe you could aggregate the savings and delete the months columns. — Ron Rosenfeld

chillin chillin · Accepted Answer · 2019-06-09T17:20:28

I think one solution might be to "only keep the first¹ annual spend per project". More abstractly, "only keep the first value in column(s) X per column(s)Y".

Below is some mock/dummy data. I only want to keep the highlighted values in my annual spend column (as the highlighted values are the first "annual spend" figures per "project").

This is the M code I'm using to achieve this. (To try it, open the Query Editor > Advanced Editor (near top right) > copy-paste code below to there > OK).

let
    OnlyKeepFirstValueInColumn = (someTable as table, columnsToNullify as list) as table =>
        let
            firstRow = Table.FirstN(someTable, 1), // This assumes first row contains a non-blank value.
            remainingRows = Table.Skip(someTable, 1),
            loopAndNullify = List.Accumulate(columnsToNullify, remainingRows, (tableState, currentHeader) => Table.TransformColumns(tableState, {{currentHeader, each null}})),
            combined = firstRow & loopAndNullify
        in combined,
    FirstValueOfColumnsPerGroup = (someTable as table, groupByColumns as list, columnsToNullify as list) =>
        let
            group = Table.Group(someTable, groupByColumns, {{"toCombine", each OnlyKeepFirstValueInColumn(_, columnsToNullify), type table}}),
            combined = Table.Combine(group[toCombine])
        in combined,
    aggregatedTable = Table.FromColumns({Text.ToList("aaabbbccccdddeeefg"), List.Repeat({1000}, Text.Length("aaabbbccccdddeeefg"))}, type table [project=text, annual spend=number]),
    transformed = FirstValueOfColumnsPerGroup(aggregatedTable, {"project"}, {"annual spend"})
in
    transformed

The important bit to understand is this line:

transformed = FirstValueOfColumnsPerGroup(aggregatedTable, {"project"}, {"annual spend"})

in which you should replace:

aggregatedTable with whatever variable/expression contains your table
{"project"} with the name of your "project" column (keep the curly braces {} though as they let you pass in several columns if needed)
{"annual spend"} with the names of whichever column(s) you want to keep only the first value in (keep the curly braces {})

This is what I get (which I think is similar to what you want):

¹To keep things simple, we'll say "first" here means the value in the first row. It could have meant "first non-null value" or "first value satisfying some particular condition or logic", but your data suggests the simpler definition will work okay.

How to remove duplicates from individual powerquery columns without removing entire rows

1 Answers