Performance Tips For "top 5 Bs for each A" MDX Queries? (Especially SSAS)

Question

Typical MDX examples for "give me the top 5 Bs for each A" look like this:

-- return top 5 clients for each firm
select
  [Measures].[Amount] on columns,
  NON EMPTY generate(
    [Firms].[Firm Name].Children,
    crossjoin(
      [Firms].[Firm Name].CurrentMember,
      TopCount([Clients].[Client Name].Children, 5, [Measures].[Amount])
    )
  )
  on rows
from [FirmsAndClients]

I'm prototyping a UI that does a lot of these "top 5" type queries, so I'm looking for any tips to speed them up in particular, especially when crossjoin(A, B) has mostly null measure values.

In this particular case Firms x Clients was big (n=5000 x n=20,000, more or less) and sparse, and I was able to speed things up by a factor of about 100 by replacing the NON EMPTY with a filter(NOT ISEmpty) inside the crossjoin:

-- return top 5 clients for each firm
select
  [Measures].[Amount] on columns,
  generate(
    [Firms].[Firm Name].Children,
    crossjoin(
      [Firms].[Firm Name].CurrentMember,
      TopCount(
        filter([Clients].[Client Name].Children, NOT IsEmpty([Measures].[Amount]))
               5,
               [Measures].[Amount])
    )
  )
  on rows
from [FirmsAndClients]

I was hoping to get a further performance benefit from pre-warming the SSAS caches by running similar queries, but I've discovered the above query (i.e. the one with "filter") is equally slow when run against warm and cold caches. Playing with Sql Profiler, I've discovered part of why this might be: While SSAS is caching portions of the underlying cube data, it doesn't seem to cache the results from the query-as-a-whole, nor does it seem to cache the intermediate sets created by generate or by crossjoin. Thus it has to redo the generate and crossjoin and topcount each time I repeat the query. And even even though each topcount appears to take 1ms or less, those milliseconds add up when iterating across thousands of Firms.

Any tips on what I can do, at the MDX level or the cube level or the Sql Server tuning level? In the SQL world I could get some mileage out of creating an index on the columns I wanted to sort by. As far as I know there's nothing like that for OLAP, though.

Conceivably this is a more general question of what to do when you want to speed up an MDX query that involves complicated sorting and filtering (which apparently doesn't sit 100% well with the SSAS caching system).

@icCube So far whenever I try to use a named set here, the set is insufficiently dynamic; it always ends up being the top five clients overall, whereas I want the top five clients for each firm. (So far it seems to make no difference if I make the set via "with set...select", "with dynamic set...select", "create set", or "create dynamic set". Experimenting with the EXISTING keyword also doesn't seem to help.) Since I haven't yet figured out how to get sets to work the way I want, I'm not sure if they could help with performance. — Chris
Do you need to filter -> filter([Clients].[Client Name].Children, NOT IsEmpty([Measures].[Amount])) before the top5 ? did you try after. crossjoin ( [Firms].[Firm Name].CurrentMember,[Clients].[Client Name].Children), a raison why not inside the top5 (I know it's not the same result).. It's all I can think about. if no other vendors can cache sets ;-) — ic3
@icCube I think you're suggesting replacing generate(crossjoin(topcount(filter(...)))) with generate(filter(crossjoin(topcount(...)))). I tried this, and it makes things worse; it's the slowest query variant I've tried yet! :) Based on my little performance tests, I've developed a crackpot theory that putting my filter as the innermost function allows SSAS to retrieve a sparser data representation than otherwise from the Storage Engine (i.e. one where nulls aren't represented) and that this really speeds up TopCount's sorting efforts. Thanks for the tip that you guys can cache sets, though. — Chris
ok you're right. One solution is merging your two dimensions (customers and clients) in a single one, with a potential many-to-many relations. So the system can very quickly get rid of all combinations of customers and clients that are not valid 'by construction' or even better use children. Performance depends on the sizing of your dimensions... but i don't see anything else. [Firms].[Firm Name].Children, means is a member ? — ic3
add nonempty([Clients].[Client Name].Children,[Measures].[Amount]) inside your filter instead of [Clients].[Client Name].Children... — ic3

Chris Chris · Accepted Answer · 2011-12-30T05:56:23

I've played with SSAS 2008 some more, and from experience I've developed a couple principles of speed for these queries:

Empty tuple filtering can help, but it won't always: As I mentioned in my original post, it can speed things up an order of magnitude to filter out empty tuples before joining two sparse sets together rather than after. (By "sparse" I mean when you cross the two most of the measure values are null.) If you know in advance that the MDX sets you're crossing are not sparse, though, the TopCount(NonEmpty) optimization may slow things down moderately.
- Note: If you do use empty tuple filtering as a performance optimization, the best place to filter seems to be before (i.e. nested inside) the call to TopCount.
- Note: In my original post I used the pattern TopCount(Filter(... NOT IsEmpty(...))). But a pattern that might be slightly faster, and seems to be more compatible with named sets, is to use TopCount(NonEmpty(...), ....) instead.
The combination of A) using named sets and B) using [MySet].Current rather than than [My Dimension].[My Hierarchy].CurrentMember in your Generate call also seems to result in moderate performance gains.
- Note: Earlier I said named sets were giving me trouble. Once I switched to NonEmpty() for my empty tuple filtering, though, I had no problems with them.

With that in mind, here's what my queries are looking like now.

WITH
SET [Set0] as [Firms].[Firm Name].Children
SET [Set1] as generate([Set0], crossjoin([Set0].Current, TopCount(NonEmpty([Clients].[Client Name].Children, [Measures].[Amount]), 5, [Measures].[Amount])))
SET [Set2] as generate([Set1], crossjoin([Set1].Current, TopCount([Time].[Year].Children, 5, [Measures].[Amount])))
SELECT
[Measures].[Amount] on columns,
NON EMPTY [Set2] on rows
FROM Lobbying2

In this case I use NonEmpty for Set1, because Firms x Clients is sparse, but I don't use NonEmpty for Set2, because FirmClientPairs x Time is not sparse.

I also noticed that using named sets made it easier to construct nested "top-five-of-top-five-of-top-five..." type queries, which is kind of fun.

Performance Tips For "top 5 Bs for each A" MDX Queries? (Especially SSAS)

2 Answers