0
votes

I have two tables (Sales,Stock) both have a DateKey and ProductCode. The requirement is to count the Distinct ProductCodes across both tables by Financial week driven by a custom Date Dimension. There is a further requirement to take the Average of the Weekly Distinct counts if multiple weeks are selected not the distinct count for all weeks.

The below DAX works in my unit tests, however on real data volumes (300+ million rows imported) it grinds to a complete standstill. If I select a week it takes about 5 sec, for 12 months its 60 seconds. They need to do YoY and WoW calcs so this perfomance will not be good enough. Not sure how else to achieve this.

Here is the DAX I've written:

Option Count = 
    // Get options at a week level for Sales
    VAR SalesOptions =
        SELECTCOLUMNS (
            Sales,
            "Week",
                CALCULATE (
                    MAX ( Dates[Financial Year Week Number] ),
                    FILTER ( ALL ( Dates ), Sales[DimBilledDateSKey] = Dates[DimDateSKey] )
                ),
            "ProductOptionCode", 'Sales'[ProductOptionCode]
        )
    // Get options at a week level for Stock
    VAR StockOptions =
        SELECTCOLUMNS (
            Stock,
            "Week",
                CALCULATE (
                    MAX ( Dates[Financial Year Week Number] ),
                    FILTER ( ALL ( Dates ), Stock[DimDateSKey] = Dates[DimDateSKey] )
                ),
            "ProductOptionCode", 'Stock'[ProductOptionCode]
        )
    // Union the options from both and then apply a distinct so that a count of the product code would be the same result as a distinct count
    VAR CombinedOptions = DISTINCT ( UNION ( SalesOptions, StockOptions ) )
            
    RETURN
        // Average the weekly disitnct option count
        AVERAGEX( 
            // Group the above union by week and do a count of the options for the week. They should now be distinct due to the distinct applied to the union
            GROUPBY(CombinedOptions, [Week], "Week Option Count", COUNTX(CURRENTGROUP(),  [ProductOptionCode] )),
            [Week Option Count]
        )
1

1 Answers

0
votes

I built a small model with a fake Dates dimension and Stock and Sales fake tables

Dates = 
SELECTCOLUMNS(
    CALENDAR(
        "2020-01-05",
        "2020-12-26"
    ),
    "DimDateSKey", [Date],
    "Financial Year", YEAR( [Date] ),
    "Financial Year Week Number",
        INT( ( [Date] - "2020-01-05" ) / 7 ) + 2
)

Sales = DATATABLE(
    "DimBilledDateSKey", DATETIME,
    "ProductOptionCode", INTEGER,
    {
        {"2020-01-05", 1},
        {"2020-01-06", 1},
        {"2020-01-06", 2},
        {"2020-01-12", 1}
    }    
  )

Stock = DATATABLE(
    "DimDateSKey", DATETIME,
    "ProductOptionCode", INTEGER,
    {
        {"2020-01-09", 1},
        {"2020-01-06", 3},
        {"2020-01-08", 3},
        {"2020-01-13", 2}
    }    
  )

model diagram

With the two relationship over the dates it's easy to filter for the fiscal week, this avoids the need to iterate over the full Date table.

To calculate the average number of the distinct product codes it's enough to iterate over the selected Financial Year Week Number and then compute the COUNTROWS of the DISTINCT of the UNION of the Sales[ProductOptionCode] and the Stock[ProductOptionCode] for the currently iterated week

Option Count = 
AVERAGEX (
    VALUES( Dates[Financial Year Week Number] ),
    CALCULATE (
        COUNTROWS (
            DISTINCT(
                UNION (
                    VALUES( Sales[ProductOptionCode] ),
                    VALUES( Stock[ProductOptionCode] )
                )
            )
        )
    )
)

If the real model doesn't contain the relationship with the DimBilledDateSKey, maybe because there already is another relationship with another dateKey in the Sales table, it's always possible to add an inactive relationship to be activated with USERLATIONSHIP inside the CALCULATE.

It should be faster, since it uses relationships, that are usually the best option from the performance point of view and minimizes the usage of Set functions, that are expensive, since they materialize the tables in memory.