1
votes

I recently bought a Professional license (coming from Personal) and I loaded some data into postgreSQL. This is my first time using a database but I have read a lot about them (SQL Demystified, Data Warehouse)

My habit with .csv files was to create calculated data within the raw data. I use pandas (python analysis library) to clean my files, add columns, do joins etc. This helped me simplify some calculated fields in Tableau, and my end-users would sometimes open these files and it was nice having the extra fields.

Example: In pandas I can create a column which is based on some logic and math on other columns. This would create a column in the raw data which is 'redundant', because it can be calculated.

Or I can calculate that with IF statements or LOD calculations in Tableau. I have read that creating columns which can be calculated and storing them in a database is a no-no. On the other hand it seems a bit silly having Tableau do extra work/calculations for values which are persistent and common.

(Example, inbound_interaction = 1 if a certain column = 'inbound' and another column = not null) (Example 2, total_time = sum of several columns)

Any thoughts or best practices on this subject? Since I am literally just starting fresh this week, I might as well get off to a good start.

1
The trade off that I would consider in storing a calculated value in a database would be based on the data and time needed to calculate the value, for example I would consider storing a balance value in a table of accounts even though I could calculate the balance by iterating over a list of transactions. Of course this comes at a cost because now the database can become inconsistent. Avoiding any calculated value storage can keep the database consistent but this requires calculation everytime you need a value.Jason Sperske

1 Answers

2
votes

The question is a bit broad and there is no golden bullet answer to it. Certain costs you have to consider are

  • Having stale vs fresh data
  • Space used by the materialized columns
  • The cost of computation on the server vs the client
  • Cost of maintaining the materialized columns

Now we have 4 solutions:

  • Raw data in the tables, extra calculations done by the client
  • Views that present calculated data to the client but which are done on the server on the fly, these can be aided by indexes on expressions. Also the views make it easier to choose what data is accessible by given user and you can have many without taking space.
  • Materialized views which are like views but store calculated data, are not fresh but are easier to maintain than some self-made materialization. Indexes are also possible on them
  • Tables that contain data calculated by applications outside of the database, for the things that your database just can't handle

I'd recommend to go with indexes and if that doesn't perform well switch to materialized indexes. Of course you may adopt a partial solution as some data is needed always up to date, other data might take a lot of time on the server but is not accessed often enough to store it materialized and some other might be really fast to compute but accessed so often that materializing it will make the whole system perform much better.