I've read the data vault book end to end, but I'm still trying to resolve one specific thing related to how you'd populate the link tables (how to get all hashes for that). From the blog of scalefree: massively parallel processing, it demonstrates that satellites and hubs can be loaded in full parallel fashion, but it doesn't go into a lot of detail related to the link tables.
Links require hash keys, thus in some way 'business keys' from multiple tables to establish the relationships, that's what they do, they record relations between hubs. There aren't very good explanations or in-depth explanations how you would retrieve the business keys of related entities when populating these link tables.
For a specific table like 'customer' things are easy for hub and satellite: just convert the business key to a hash and load both of them in parallel.
But a customer details table or a transaction table from an OLTP need some kind of join to happen to look up the business key for the customer or to look up all the related entities in the transaction (product, customer, store, etc), because those tables do not typically store (all) business key(s) as an attribute in the table.
If I assume that staging is loaded incrementally and truncated, then staging doesn't necessarily have all the entities loaded to be able to perform joins there. How to resolve this dilemma and create a design that works?
- Join on tables in the source OLTP systems to generate the business keys from there and propagate them as hashes from there? (this ends up wrong if the business key was chosen incorrectly)
- Use a persistent staging area, so never truncate? (then it's always possible to join on any table in there to resolve)
- Use some kind of index for surrogate keys -> business keys and perform a lookup from there? (minimizes I/O a bit further and is a mix between incremental staging and persistent staging).
- some other method...?
Essentially, what is the best practice for generating the hashes for all foreign key relations of your OLTP systems?