0
votes

I would like to aggregate our IIS logs and be able to quickly perform simple queries against them (e.g. how many times as page x hit in the past month etc.)

I'd like aggregrate this data (maybe down to 10 minutes or so granularity on the time dimension) into an SSAS cube.

We already import the logs via SSIS into tables. The tables are becoming very large, and I'd like to start deleting older data (say more than 3 months ago), whilst keeping the history in the cube (so I can still query over longer than 3 months). Is this a reasonable thing to do - clearly I'd not be able to rebuild my cube easily should I wish to change it... and I guess I'd need to start backing up the cube as if it was a database?

Currently I don't have a PK on the data - Would it be better to add an identity column or create a composite PK out of date,time and url?

Any feedback from someone that has successfully implemented this would be fantastic :)

Thanks!

1

1 Answers

1
votes

I haven't done exactly this, but I'll give you opinions on as much as I can:

The tables becoming large, why is this a problem - Is it storage space, or speed?

If speed, consider using table partitioning to split your large tables. You can partition them by date range, then switch the partitions into another table (Reducing the size of the original), this is a pure metadata operation and will be instant. SSAS can then use a view that unions both tables when it processes, if you ever need to rebuild you can.

If storage space, have you looked at compression in SQL Server (Available in 2008, not sure what version you're on?).

Personally I just wouldn't be happy without the ability to rebuild the cube - Also remember that the cube makes a copy of the source data (Or rather the parts it uses as per the DSV) so you may not save as much as you think if you delete old data & treat the cube as the 'storage device'. Does your cube only use a portion of the tables? What size is it compared to the underlying data?

A PK on the data is not strictly required for SSAS - BUT - I always use them, primarily to prevent duplicate loads (I also load by time - Check data is newer than last loaded) but it's good to have a PK constraint preventing duplicate loads.

For your PK, Date, Time, URL sounds good, but depends on how busy your site is. Your example would not allow two people to view the same URL at the same second. Could you add IP Address to the PK? What if a visitor refreshed quickly? Would/Could you treat that as a duplicate, and remove it in the SSIS dataflow?

Good luck, let me know if you have any questions on what I've said.