2
votes

I have a cube that has a fact table with a month's worth of data. The fact table is 1.5 billion rows. Fact table contains the following columns { DateKey,UserKey,ActionKey, ClientKey, ActionCount } . The fact table contains one row per user per client per action per day with the no of activities done.

Now I want to calculate the below measures in my cube as follows

Avg Days Engaged per user AVG([Users].[User Key].[User Key], [Measures].[DATE COUNT])

Users Engaged >= 14 Days SUM([Users].[User Key].[User Key], IIF([Measures].[DATE COUNT] >= 14, 1, 0))

Avg Requests Per User IIF([Measures].[USER COUNT] = 0, 0 ,[Measures].[ACTIVITY COUNT]/[Measures].[USER COUNT])

So to do this, I have created two distinct count measures DATE COUNT and USER COUNT which are distinct aggregations on the DateKey and UserKey columns of fact table. I want to know partition the measure group(s) ( there are 3 of them bcoz of distinct measure going in it's own measure group).

What is the best strategy to partition the cube? I have read the analysis service distinct count guide end-end and it mentioned that partitioning the cube by non-overlapping user ids is the best strategy for single user queries and user X time is the best for single user time-set queries.

I want to know if I should partition by cube into 75 partitions each (1.5 billion rows/20 million rows per partition) which will have each partition with non-overlapping and sequential user ids or should I partition it into 31 partitions one per day with overlapping userid but distinct days in each partition or 31 * 3 = 93 partitions where I break down the cube into per day and then for each day further partition in to 3 equal parts with non-overlapping user ids within each day (but users will overlap between days) or partition by ActionKey into 45 partitions of un-equal size since most of the times the measures are sliced by Action?

I'm a bit confused because the paper only talks about optimizing on a single distinct count measure, where as I need to do distinct counts on both user and dates for my measures.

any tips ?

1

1 Answers

0
votes

I would first take a step back and try the Many-to-Many dimension count technique to achieve Distinct Count results without the overhead of actual Distinct Count aggregations.

Probably the best explanation of this is the "Distinct Count" section of the "Many to Many Revolution 2.0" paper:

http://www.sqlbi.com/articles/many2many/

Note Solution C is the one I am referring to.

You usually find this solution scales much better than a standard "Distinct Count" measure. For example I have one cube with 2b rows in the biggest Fact (and only 4 partitions), and a "M2M Distinct Count" fact on 9m rows - performance is great e.g. 6-7 hours to completely reprocess all the data, less than 5 seconds for most queries. The server is OK but not great e.g. VM, 4 cores, 32 GB RAM (shared with SQL, SSRS, SSIS etc), no SSD.

I think you can get carried away with too many partitions and overcomplicating the design. The basic engine can do wonders with careful design.