5
votes

I'm trying to model my data warehouse using a star schema but I have a problem to avoid joins between fact tables.
To give a trivial idea of my problem, I want to collect all the events who occur on my operating system. So, I can create a fact table event with some dimensions like datetime or user. The problem is I want to collect different kinds of event: hardware event and software event.
The problem is those events have not the same dimensions. By instance, for a hardware event, I can have physical_component or related_driver dimensions and, for a software event, software_name or online_application dimensions (that is just some examples, the idea to keep in mind is the fact event can be specialized into some specific events with specific dimensions).
In a relational model, I would have 3 tables organized like that: enter image description here
The problem is : how to handle joins between fact tables in a star schema?


I imagined 4 ideas but I'm not sure one of them are adapted to the situation.
The first one is to keep the model used in a relational database and add the dimension tables like that:
enter image description here
In this case, the problem is we still have join between fact tables and need to use JOIN SQL statement in all of our queries.
The second one is to create only 2 fact tables who will duplicate the shared dimensions (datetime and user) and to create a materialized view event who summarized all the events:
enter image description here
The problem here is: what happen if I want to make a query on the materialized view? According to what I read in the Oracle documentation, we don't have to make query directly on materialized view but we have to let the query rewrite process make its work.
The third one is to create only one fact table who will contain all the information possible for an event (hardware or software):
enter image description here
This time, the problem is my fact table will contain a lot of NULL value.
And the last one is to create 3 fact tables (without materialized view this time) like this:
enter image description here
This time, the problem is all events are present in the fact table event and in its own table. Because we will store a huge quantity of data, I'm not sure this duplication is a good idea.
So what is the best solution? Or does it exist a fifth solution?
4
How similar are the software and hardware events in their data structure (content-wise, not column names)? If similar you could just use a single dimension like "Source" with some shared columns and some non-shared that are set to 'n/a' for the type they don't apply to.Cyrus
@Cyrus Software and hardware events are totally different in their structure (number of column) and in their content (no link between the column of each table). All the shared columns are in the event table.Pierre

4 Answers

2
votes

From your description and your subsequent comments to other answers, I'd say that option 2 or option 4 are the right way to model things from a dimensional modelling perspective. Each fact should be a measure of a business process, and the dimensionality of software and hardware events seems to be sufficiently different that they warrant being stored separately.

Then, there's a case for also storing the separate events table as a view, materialised view, or plain-ol' table storing the things that are common.

Once you've decided that's the right way to model things 'logically', you then need to balance performance, maintainability, usability and storage. For dimensional modelling, usability and performance of queries take top priority (otherwise you may as well not use a dimensional model at all), and the extra work in ETL, and extra space needed, are prices worth paying.

A non-materialised view would save you the space at the price of performance, but it could be that you could give it a sufficiently awesome index or two that would mitigate that. A materialised view will give you performance at the price of storage.

I'd be tempted to create the two fact tables with indexes and a non-materialised view, and see what performance of that is like before taking further performance enhancing steps. 10 million fact rows isn't so bad, it might still perform.

A materialized view can be queried directly. But if you want to, you can use the query rewrite capabilities of Oracle so that the Materialized view is instead used as a performance-enhancer, like an index, when you're querying the original tables. See here for details: http://www.sqlsnippets.com/en/topic-12918.html Whether you choose to use it in query rewrite mode or just as a view in its own right depends on whether you want the users to know about this extra table, or for it to just sit in the background as a helpful friend.

0
votes

There doesn't seem to be a reason in your scenario to combine or link the two types of events. Having said that, you may have some reason you did not describe (for example, collecting logs from many systems and wanting to see them together easily).

So my advice is to make a single fact table with both hardware and software dimension keys. One of them is always going to be 0 or -1 (= default 'n/a' record).

This allows you to aggregate them together without UNION statements or other complicated logic and can even support events that are linked to both hardware and software if they appear in the future.

0
votes

You would never/rarely join fact tables together. You may join aggregated facts which share (conformed) dimensions (i.e. Number of software events per hour compared with number of hardware events per hour).

To me, you always have to consider the kinds of questions that are going to be asked when looking at dimensional modeling.

0
votes

Events should be a single fact. If you split them in two, you'll have a difficult time doing aggregations across both.

If necessary, you can have separate hardware and software attribute dimensions, but you should have a generic event dimension, even if it is just a junk dimension with a few simple attributes, e.g. type (hardware/software), criticality (high, low), etc.

On a side note, I've generally seen the diagrams with the arrows coming from the fact going to the dimensions. The fact table keys look at the dimensions rather then the other way around.