0
votes

Typically measures in an OLAP cube are a numbers, and these numbers get aggregated via addition (or via some not-very-exotic function like times or MAX or MIN). I'm wondering if any of the major OLAP servers let you make measures that are sets of numbers, or sets of strings. ("Set" here is in the mathematical sense, not the OLAP jargon sense of "a list of tuples".) Whereas OLAP typically aggregates measures via numeric functions, my hypothetical "set measures" would be aggregated via set operations, e.g. set union or set intersection.

I'm interested both theoretically and practically. Theoretically/abstractly/mathematically, there's a nice parallel between addition-over-integers and union-over-sets, and it seems like someone could have considered this parallel in writing an OLAP server. (One potential implementation, if all the possible set members were known in advance, is to represent each set as a (potentially large) integer, and then to aggregate/union by performing bitwise OR.)

As for practice, I'll attempt to provide a concrete case where this might seem at least marginally useful: Suppose you had a dataset where each fact was the metadata associated with an academic paper. Each paper might have a date, a topic, and a set of one or more authors, like so:

  • fact1: {"Year": 1997, "Topic" : "AI", "AuthorSet": ["Bill Jones", "Martha X"]}
  • fact2: {"Year": 1997, "Topic" : "Linguistics", "AuthorSet": ["John Q", "Sam S"]}
  • fact3: {"Year": 1997, "Topic" : "Linguistics", "AuthorSet": ["John Q", "Jack X"]}
  • etc.

(I'm using quasi-JSON here only because it helps make it obvious what's multi-valued.)

If you created an OLAP cube around this data, it would seem very natural to be able to make a report showing how the set of authors writing on a given topic changed from year to year. In MDX, it might look like this:

select
[Measures].[AuthorSet] on columns,
[Year].[Year].All on rows
where ([Topic].[Topic].[AI])

For each year, this query would roll up the list of authors via set union.

To get the very most out of this feature you'd probably need custom OLAP client tools that knew about set measures in particular. But for existing clients you could probably just fall back to some string representation of a set. (e.g. the above query could return cells containing, e.g., the string "Bill Jones; Martha X; John Q; Sam S; Jack X")

I'm most familiar with SSAS, and SSAS doesn't seem to support anything like this out of the box. It seems like there might be a way to hack it together using measures of string type, maybe plus custom CLR functions, but I haven't figured it out yet.

This particular case you might also be able to solve in MDX only, maybe using custom members, Generate and SetToStr? (Solutions welcome!) But my intuition is that as things get more complicated it'd be more natural to have the set aggregation happen more naturally behind the scenes, with this "set member" functionality.

2
perhaps I'm missing something, but couldn't you accomplish this with a Factless Fact Table?Bill Anton
@iPolvo, good idea. You should provide this as a separate answer so I can upvote. I'm unclear if it scales to more complicated cases, though. Here's a contrived example: Suppose each academic paper had a "score", and for each year I wanted not just the set of authors but also the average score of papers that year. If I had my "set members" feature, I could do this w/ 1 single MDX query: "with member [Measures].[AverageScore] as [Measures].[Measures].[Score] / [Measures].[Fact Count] select [Year].[Year].All on columns, {[Measures].[AuthorSet], [Measures].[AverageScore]} on rows".Chris

2 Answers

1
votes

Here's what I had in mind using a Factless Fact tables...

FactlessAcademicPaper
   YearKey_FK
   CategoryKey_FK
   AuthorGroupKey_FK

DimYear
   YearKey_PK

DimCategory
   CategoryKey_PK

FactlessAuthorGroup
   AuthorGroupKey_PK
   AuthorKey_PK

DimAuthor
   AuthorKey_PK

This...

fact1: {"Year": 1997, "Topic" : "AI", "AuthorSet": ["Bill Jones", "Martha X"]}
fact2: {"Year": 1997, "Topic" : "Linguistics", "AuthorSet": ["John Q", "Sam S"]}
fact3: {"Year": 1997, "Topic" : "Linguistics", "AuthorSet": ["John Q", "Jack X"]}

...becomes this...

**FactlessAcademicPaper**
   Year:           1997
   Topic:          AI
   AuthorGroupKey: 1

   Year:           1997
   Topic:          Linguistics
   AuthorGroupKey: 2

   Year:           1997
   Topic:          Linguistics
   AuthorGroupKey: 3

**FactlessAuthorGroup**
   AuthorGroupKey: 1
   Author:         Bill Jones

   AuthorGroupKey: 1
   Author:         Martha X

   AuthorGroupKey: 2
   Author:         John Q

   AuthorGroupKey: 2
   Author:         Sam S

   AuthorGroupKey: 3
   Author:         John Q

   AuthorGroupKey: 3
   Author:         Jack X

Now if you wanted to track scores of the papers, then you would just add it to the FactlessAcademicPaper table...

FactlessAcademicPaper
   YearKey_FK
   CategoryKey_FK
   AuthorGroupKey_FK
   Score
   '1' AS PaperCount

And as you can see above, I've added a calculated field called PaperCount which should make it easier to compute the average score.

1
votes

My comment is biased as I'm working with the team building icCube OLAP Server.

Except in the Spatial OLAP field, I don't know of OLAP servers dealing flexible with other as basic Measure types. There are some techinical difficulties: you've to enter the new types into the system (it's very one column one measure), store them internally (this with mio of rows), aggregate (again this over mio of rows) and eventually send the result back to the client (XMLA). From our side the server is mostly ready to get new types; we wanted to support VaR and for this you need to support matrices and vectors. Here as for other vendors is a question of a customer being interested enough.

Now for you particular problem as 'iPolvo' is pointing out this can be solved at dimension or hierarchy level. You can define a calculated method Generating a string with the existing authors.

How you solve this depends on your sizing. For performance, if the model is not big, I'd model this as a single dimension where you can use children(). No need to have facts here. On the opposite side, bug size, you can build three dimensions and us nonempty over all existing authors (that's might be not fast). You can indeed combine both..

For this you need to transform

fact1: {"Year": 1997, "Topic" : "AI", "AuthorSet": ["Bill Jones", "Martha X"]}

in

fact1.1: {"Year": 1997, "Topic" : "AI", "Author": "Martha X"}
fact1.2: {"Year": 1997, "Topic" : "AI", "Author": "Bill Jones"}