Junk dimensions: applying the KonMari method to fact tables

Note: this is one of those optimizations that is not worth the effort unless your app is very large. The larger the app (like multiple GB), the more mileage you can get out of this.

The problem

Tall, wide tables require a lot of RAM. Fact tables can have a lot of rows. If they also have a lot of columns, it can result in apps with poor performance and a large server footprint. High app size and RAM consumption is experienced by developers and users directly: operations like saving and the opening the app and user calculations take longer.

You may be saying, "Who cares if the table is wide? Doesn't Qlik just store the unique field values just once?" Yes, but the distinct field values are just one part of the data storage. Qlik also carries a data model structure just like the one we see -- the tall, wide fact table -- with pointers to those fields' distinct values, which takes its own space. All those fields in the fact table do not bring me joy.

We know that star schemas are the ideal Qlik data model, not one, giant table. Sometimes you receive data in a wide, "flattened" format. Sometimes you are asked to improve performance on something with this existing structure. Either way, this is also a problem in traditional databases, and Qlik can leverage the same solution as can be found in dimensional modeling.

The solution: junk dimension

According to The Data Warehouse Toolkit:

A junk dimension is a convenient grouping of typically low-cardinality flags and indicators. By creating an abstract dimension, we remove the flags from the fact table while placing them into a useful dimensional framework.

Translation: move some low cardinality (few distinct values) fields from the fact table into a new table, then relate the new table to the fact table on a new key.

The fact table gets narrower and has a relationship to a new table with a modest size: it needs only one row for each distinct combination of the fields occurring in the fact table. This works well for Qlik because it performs well with low cardinality dimensions (few distinct key field values), even when the fact table has a lot of rows.

Here's an example of what one might look like in Qlik:

For fields to consider moving, look for those with few distinct values.

Good candidates are statuses, buckets, flags, types, etc.
To continue getting correct results, don't move any fields that are aggregated in front end calculations.
For performance, do not move anything that is used for row-level calculations, ex. Sum(JunkDim.FlagField * FactTable.MeasureAmount).
And don't move any keys that are part of relationships to other tables.

How to implement in Qlik

General steps (in English)

Note: we are assuming the fact table is quite large, so we're avoiding doing a resident load of the entire thing again to add the junk dimension foreign key to it. We add the key using a join, instead.

Generate list of Junk dimension concatenated key values by resident loading the fact table (non-distinct; distinct will be efficiently applied in next step)
Autogenerate Junk dimension dynamically, with distinct _KeyJunkDim and component fields
Add Junk dimension SK (_KeyJunkDim) to fact table using a left join, then drop component fields from fact table
Autonumber the concatenated key that associates the tables, to save app size

Subroutine

I created a subroutine encapsulating the above because it makes it easy to test different scenarios for the performance differences or see if it is worthwhile at all before making major changes to your load script. Just call the subroutine sometime after the fact table has been loaded, and it runs entirely against resident data. The call is literally one line of code, and testing different scenarios only requires changing that one line of code.

To implement:

Include this file or copy and paste its contents near the beginning of your script: CreateJunkDim.qvs
Then call the subroutine by passing 1) a vertical bar-delimited list of fields to move and 2) the table where they are currently found:

        CALL CreateJunkDim('Dim1|Dim3|Dim2', 'Transaction')

Check your work:

The LEFT JOIN that happens in the subroutine will not work properly if there are nulls in any of the fields you move to the junk dimension. (Null does not join to null.)
To make sure it worked as expected, check that the junk dimension key in the fact table is 100% populated in the table viewer. (Density = 100%)

Because the subroutine is dynamic -- any number of fields could be passed to it -- the code is abstract to understand. Some of the load script dynamically generates load script to be executed later. That couldn't be avoided to make it reusable, although I commented and traced as well as I could. (I would have trouble understanding it if I didn't write it. Sorry.)

To test different iteration results, use the QSDA Pro to check the resulting app size in RAM and overall calculation time. Even if the calculation times stay the same, if the app and RAM shrink, that's still a net positive. (App save time may be harder to measure, but in the prototype I created to generate the screenshots and subroutine, it took an extra 18 seconds to build the junk dimension, but that app took 1 minute and 45 seconds less to save than the original app data model without the junk dimension.

You may find the subroutine is efficient enough to use when you productionalize your solution. If so, great. However, with some additional work, it would reload faster overall to add the concatenated junk dimension key to the fact table during a transformation step, then "autogenerate" the junk dimension in the front end reload. That enables you to skip the first few steps of the subroutine and autogenerate the junk dimension instantaneously.

MICHAEL STEEDLE

Search This Blog

No code, on-demand language translation in Power BI