On Links

Petr Beles, 2150 GmbH, https://elementor.2150.ch 2017-02-11

Links representing transactions in Data Vault

In this article I’m discussing only links representing transactions. I deliberately exclude links which represent relations (i.e. master data relations = giving context to an object like the link between a city and a commune). A row in a conventional link in Data Vault is defined by a unique combination of relations (represented by the hash of the compound BKs of all involved hubs) plus potentially including a dependent child key. (Building a scalable DWH with DV 2.0 – Linstedt & Olschimke 4.4.5)
Order UID Sales Order ID Order Line Number Customer Product Quantity Price Total
345789 SO-02155 1 Smith Bottle of Wine 6 20$ 120$
All relations would be used to identify this row and all attributes would be loaded into a link satellite In the real world transactions often do have unique business keys (the number that is communicated to the customer – not the technical key). Let’s take an order as example: an order has an order number and every line of this order has a line number. Even if the line number on its own is a weak entity it is uniquely identifiable if combining the transaction id with the line number. Even if the line number is not printed on an order slip it is there and can be addressed by a customer on the phone.
Order UID Sales Order ID Order Line Number Customer Product Quantity Price Total
345789 SO-02155 1 Smith Bottle of Wine 6 20$ 120$
Identifying parts as defined by the business process. This means that the hash key[1] depends on identifying relations like the one to the transaction (if modeled explicitly) and the dependent child key like the order line number. The problem is that also non-identifying parts such as the customer or the product relation are used to build the hash key for a row in the link table. In other words, the hash key representing the combination of all business keys plus the dependent child is in fact a hashdiff key and not a transaction identifier. There is nothing wrong with that. A hashdiff key can be a very powerful tool to determine whether a row needs to be inserted or not as you just need to read the index on that column to decide about adding a new row to the link table. If the non-identifying relations of a transaction never change after the transaction was submitted to the Data Vault the grain of the link is even the same as the grain of the transactions. The problem is: in many industries non-identifying relations of a transaction might change or arrive after it was submitted to the Vault such as customers being exchanged or even products. In my experience in the manufacturing industry this are standard business processes. One example I like is the process where a potential customer calls to check for availability of a product: the call center agents first creates an order for a dummy customer as it is not worth opening a new customer account if the product is not in stock. After availability is checked they write down the contact information on paper and enter it later after the customer hangs up so the customer doesn’t need to wait on the phone if the it system is not responsive. If they have a high call volume they take care of entering it sometimes the next day when the first version of the transaction was already added to the vault. Another example is when the product of an order is exchanged for a newer product as the ordered one is not available anymore but the new one is a perfect replacement. Both valid business processes lead to a second entry in the conventional Data Vault link. Does this cause a problem for the link tables as they are described in the default? Not from a technical point of view: for every change of the customer or product a new hash key is generated and a new row is inserted. But it is not a SCD Type 2 style table as every combination of transaction to customer/product is inserted only once. If you track the validity of the different version in a link satellite you can capture all possible combinations of incoming changes. To put it short: I believe the conventional design is very well designed for fast write speeds, can track all incoming changes and has proven to work in the real world.

So why change something that is working?

Honestly because I never understood the grain of the link table: a transaction has initially one row when it arrives the first time. If the customer of an order changes as described above a new row is added. So, we do have now two rows in the link table for the same transaction. If the order is reverted to the first customer it has still two rows. Don’t get me wrong: all the information is there if you have a tracking satellite and know how to query it. But to explain on a logical level what the grain of the table is to a report designer is somehow complicated, at least for me. Yes, you can and you have to abstract this scenario on an interface between the Vault and the reporting layer but you just get the complexity there. And the second and more important issue is a logical one: link to link links have to be omitted as not only you would break your agile processes due to dependencies but also you would link versions of your transaction to each other instead of the transaction itself. Additionally, this solution is not optimal if you have relations that are late-arriving like tickets which are assigned to customers later in the business process.

So what are the changes that could improve the situation in my personal opinion?

  • Create one hub that is always representing the finest grain of your transaction. In the example above: sales order line with a BK of order id in combination with the line number[2]
  • Include the link to this hub in your transaction. Get rid of the dependent child key in the link as it is not necessary anymore
  • Connect all context attributes of your transaction to the hub as a normal satellite instead of connecting it to the link
  • Connect the tracking satellites that check of missing keys in the source to this hub instead of the link
  • Decide if your transaction can change after it was submitted to the Data Vault
    • If yes: load your link as SCD Type 2 load based on this driving hub – You can create a virtual or materialized SCD Type 1 view on top of this link by partition by the keyed-instance hub key
    • If no: load your link as SCD Type 0 load based on this driving hub (or create a SCD Type 0 view on top of your SCD Type 2 table if you don’t trust your source – virtually or materialized)
  • If performance is an issue you can still use the hashdiff approach to load the SCD type 2 link table (for SCD type 0 you can do the comparison only on the hash of the BK of the driving hub)
  • If you need to link two transactions just create a link between the two keyed-instance hubs based on the right grain of these transactions
Hub Sales Order Line
Sales Order ID Order Line Number Sales Order Line Hash
SO-02155 1 85036a318c424cdd…
 
Link Sales Order Line
Load Time Sales Order Line Hash Product Hash Customer Hash
02.11.2012 11:22 85036a318c424cdd… 2bdf8a1484a74e1d… 1e292c434a272b21…
 
Sat Sales Order Line
Load Time Sales Order Line Hash Quantity Price Total
02.11.2012 11:22 85036a318c424cdd… 6 20$ 120$
Technical representation of hub, links and satellite. Identifying parts of rows in yellow. Some technical fields, tracking satellite and Sales Order Hub omitted for readability. The funny thing is, when you look at what you need to do technically, you do exactly the same as for loading a satellite: you compare attributes (linked hash keys) based on the hash key of the hub.[3] Additionally, adding non-identifying link parts becomes absolutely agile as they do simply connect to the same hub and can be unioned with the existing link in a lossless manner. And as last point you can put optional parts of a link into a separate table which is sparsely populated. With the hub holding the right grain of the transaction you can bring the mandatory and optional parts together. Just apply the same pattern as when combining different satellites.
Link Sales Order Line Store
Load Time Sales Order Line Hash Store
04.11.2012 11:22 85036a318c424cdd… 416360b329ad4ddab39dde7ee5c97479
Additional Link modeled on its own to capture the either later implemented, optional or late arriving link part

Is this opposing the Data Vault standard?

Honestly for most of the parts I believe it doesn’t:
  • Nothing should prevent you to create a hub at the lowest grain of a transaction and call it “keyed-instance” hub
  • Store attributes dependent on a hub key as satellite as opposed to a link satellite? I don’t see a problem either.
  • Is it possible to implement links which are really identified by all linked objects: yes that is possible by combining all the keys into the hub.
  • Can everything that is delivered be reproduced from the data stored: yes
  • Is losslessness (Verbundtreue) guaranteed: yes
  • Anything different for non-historized-links: no
There is one but: by implementing the link as SCD type 2 load based on the keyed instance hub changes the amount of rows in the link table. In my example with an order which switches from customer A to B and back would have only 2 rows in the original pattern:
Data Vault Standard Link
Load Time Sales Order ID Order Line Number Customer Product
02.11.2012 11:22 SO-02155 1 Smith Bottle of Wine
04.11.2012 11:22 SO-02155 1 Miller Bottle of Wine
Logical representation. Technically Customer and Product would be stored as hashes and a link hash added based on the orange fields. 3 rows would be found in the tracking link satellite In the new design it has now 3 rows. The PK is not the hashdiff key based on all parts but the keyed instance hub hash key + the load time.
Link Sales Order Line
Load Time Sales Order Line Product Customer
02.11.2012 11:22 SO-02155 & 1 Bottle of Wine Smith
04.11.2012 11:22 SO-02155 & 1 Bottle of Wine Miller
06.11.2012 11:22 SO-02155 & 1 Bottle of Wine Smith
Logical representation So yes, there is a small deviation from the standard which in my opinion makes sense as it makes querying the table much simpler: if you want to get the last version of the transaction just partition it by the driving hub key and take the latest load date. No need to join any other tables to find the latest version as you would need to do in the original design. Same for creating a complete history of the transaction: just partition by the driving hub key. And most important the grain is defined again: every version of the transaction over time gets its own row.

So shall I use it

It is up to you. But I’m confident that the pattern works as described and we proved it in real-world scenarios including some real-time processing and I like the benefits that come with it. If you like the original pattern better and you understand what the hash key represents and you are confident you will not run into the described problems later – why not? So do I use it for all links? No! But I do use it for all links representing transactions. Links between master data representing context are a different story I will discuss separately. [1] If I write hash key I mean hash or any other form of surrogate key. Only for DV2.0 hash keys are mandatory. [2] I learned from the Genesee academy that they promote this hub as “keyed instance” [3] As it is working technically very similar to normal satellite loads you can also add normal context attributes to the link and create a non-historized-link which includes, besides the hash keys, also some context columns without technically changing anything in the loading procedures. Please be aware that this is not a generally recommended design and should be only used in high volume / near real time scenarios