1
votes

I have a table where I want to filter down to only those rows where the group that row belongs to is found to have only a single unique value. For example, if I am grouping the following data by the index column, I want to count the unique values of foo in each group and then drop any rows where there is more than just one value of foo

Example Input:

   index  foo
0      A   10
1      A   20
2      A   30
3      B   10
4      B   20
5      B   30
6      C   10
7      C   20
8      C   30
9      D   10
10     D   10
11     D   10
12     E   10
13     E   20
14     E   30
15     F   10
16     F   10
17     F   10

Expected Output:

  index  foo
0     D   10
1     D   10
2     D   10
3     F   10
4     F   10
5     F   10

I am able to do this in Python but am having trouble implementing the same solution in DAX (Power BI).

This is my Python approach:

import pandas as pd

data = pd.DataFrame({'index': {0: "A", 1: "A", 2: "A", 3: "B", 4: "B", 5: "B", 6: "C",
                               7: "C", 8: "C", 9: "D", 10: "D", 11: "D", 12: "E",
                               13: "E", 14: "E", 15: "F", 16: "F", 17: "F"},
                     'foo': {0: 10, 1: 20, 2: 30, 3: 10, 4: 20, 5: 30, 6: 10,
                             7: 20, 8: 30, 9: 10, 10: 10, 11: 10, 12: 10,
                             13: 20, 14: 30, 15: 10, 16: 10, 17: 10}})


# count distinct values of "foo"
foo_distinct = data.groupby('index').agg(lambda x: len(set(x)))

# add the distinct value counts of foo to main table
joined_data = data.set_index('index').join(foo_distinct, rsuffix='_distinctgroupcount')
        
# filter out any values where there is more than a single distinct value
filtered_data = joined_data.loc[lambda df: df['foo_distinctgroupcount'] == 1])

And my attempt in DAX. The calculation of foo_distinct works as expected but I am having trouble getting the distinct group counts merged into the original data set (the attempt to get joined_data does not work).

foo_distinct = SUMMARIZE(data, data[index], "foo_distinctgroupcount", DISTINCTCOUNT(data[foo]))
            
joined_data = CALCULATETABLE(NATURALLEFTOUTERJOIN(data, foo_distinct), USERELATIONSHIP(data[index], foo_distinct[index]))

My questions:

  • Is this problem best solved by groupby, distinctcount, and joining those values back in to the original data or is there a more efficient/elegant approach?
  • I believe I see that NATURALLEFTOUTERJOIN can't be done since data and foo_distinct do not share a relationship (but they should since they are derived from the same source; however this function does not accept a USERELATIONSHIP call as far as I can tell. Is there a function that is more flexible for handling this? For example SQL would have the ON keyword available.
2
The key here is groupby + transform. Transform broadcasts the aggregated result back to every row that group belonged to in the original DataFrame, making it an easy and efficient way to filter entire groups from your DataFrame based on an aggregated result.ALollz

2 Answers

3
votes

you can try groupby with transform (consider a distinct count over a window partition by index) and nunique , then filter in values where the distinct count is 1:

df[df.groupby("index")['foo'].transform('nunique')==1]

   index  foo
9      D   10
10     D   10
11     D   10
15     F   10
16     F   10
17     F   10

With DAX you can try DISTINCTCOUNT with EARLIER over index and then use a FILTER maybe to filter DISTINCTCOUNT('index')=1 (have not tried but you can try something like this)

1
votes

If your goal is to create a new table using DAX:

Desired Output = 
VAR foo_distinct =
    FILTER (
        SUMMARIZE ( Data, Data[index], "Foo Count", DISTINCTCOUNT ( Data[foo] ) ),
        [Foo Count] = 1
    )
RETURN CALCULATETABLE ( Data, foo_distinct )

Result is a new table in the data model:

enter image description here

The way it works:

  • First, filter the summary table to keep only indexes where distinct foo count = 1. This step creates a list of unique indexes.
  • Then, filter the original data by the list of unique indexes.

I have to say that your second measure is very confusing - it involves other tables and relations that are not described in your input and desired output. If your goal is different, please elaborate.