Star schema dimensions: to join or not to join? redux

I did further testing of scenarios from this previous post comparing joining dimensions to sorting by key fields:

in Qlik Sense
using the Qlik Sense Document Analyzer (and DevTool extension, as a comparison)

Please read that post first, if you haven't. I did not recap all context here.

Measurement notes

The Qlik Sense Document Analyzer measures object calculation times one by one, isolating an object's performance, independent of other objects on the same sheet.

The DevTool extension (and QlikView sheet object times) tests whole sheets, with objects competing for computational resources. If an object is slow, it may be independently slow, or it may just be waiting for other objects to finish consuming resources. This makes it hard to measure the effects of changes at a detailed level without creating test sheets containing isolated objects.

This is important because you will notice that the Dev Tool results have universally higher calculation times for the same scenarios below. The wait times in the DevTool may be closer to the user experience, but I tend to use the Qlik Sense Document Analyzer's total calculation time (sum of all independent objects) and improving the slowest objects as my proxy for performance, because it allows you to isolate variables better.

The results

Qlik Sense June 2019 yielded worse results by denormalizing than doing nothing, when controlling for the sorting that resulted from the joins (scenario 0 vs. 3 below). This was true regardless of the measurement method used. Denormalizing alone hurt calculation performance and increased the footprint of the app. The sorting that was a byproduct of joins was the root cause of the improved performance.

If you aren't convinced the sort order influences performance, look at the difference between joining Customer then Calendar (scenario 4a) and Calendar then Customer (scenario 4b). This seemingly arbitrary decision made a calculation time difference of 30 to 50%, depending on the measurement method.

I added another normalized test case (scenario 2) in which I sorted the fact table by both keys, duplicating the sort order of the final fact table in the best performing normalized use case (scenario 4b). It was modestly the best overall calculation time, plus a 37% smaller RAM footprint, leading to a shorter save/distribution time and better ability to scale on the same hardware. Sorting alone bested the best possible calculation times of the denormalized scenarios.

Qlik Sense Document Analyzer: serial object calculations

DevTool: simultaneous object calculations

Grains of salt

This example UI was a perfect setup to supercharge performance due to the similarity of the charts. Your mileage will vary. Most charts share the same first dimension, and the expressions have no Set Analysis or complex operations. In the real world, I have sorted fact tables to great effect and sometimes to almost no effect on performance. It depends on a lot of factors in the data itself, the UI, and what performance you are trying to optimize, average object performance or a few problematic objects.

Denormalizing still helps in some cases. My data model target is still generally a star schema rather than a snowflake to minimize the number of hops between one end of the data model to the other. The footprint increase that results from combining smaller tables like attributes and dimensions is likely to be trivial.

Confusions

Joining/denormalizing was traditionally considered a tradeoff of better calculation times for a larger app footprint, but I would consider that myth busted. Joining does increase the app footprint, but it also changes the resulting table's sort order, which is a hidden influence on performance that is perhaps more important than the data model structure. The sort order is something that can be changed, in isolation, without joining and increasing the app footprint. If you plan to denormalize in an effort to improve performance, test the before and after, plus another test case sorting by the key to the dimension instead of joining.

On the surface, the six scenarios I tested were just two different data models, but the calculation results varied widely, depending on the final sort order. The data model structure was ultimately a less important influence on performance than the fact table sort order.

Conclusions (from all four related posts)

Reduce the app footprint as much as possible before you start testing optimizations: field count, then field footprint, then row count -- not a new conclusion! (written about in more detail here)
Changing the sort order without changing the data model can mean better performance and a lighter application. Explore keeping the same data model structure but sorting keys in the fact table (Scenario 2 vs. all denormalized examples).
Ordering by the key to a dimension yields similar performance to joining that dimension, without the increased app footprint.
Be cautious when denormalizing with high row count fact tables (100M+ rows). Anecdotally we have seen the additional bloat from joining create worse performance than leaving the data model normalized (Scenario 0 vs. 3).
When you are testing, be aware of upstream transformations that may have already sorted your control scenario, which can make results hard to interpret.
For general improvement

Focus on the highest cardinality dimensions, i.e. most distinct values in the key field
Focus on frequently used dimensions or keys to frequently used dimensions, especially the first dimension used in charts
Low-cardinality fields and dimensions don't seem to make much of a difference
Pick one sort and go with it. The impact from the secondary sort (scenario 2 vs. 1) rarely makes a meaningful improvement, except in the specific AGGR scenario mentioned below with multiple dimensions listed. This will also help reduce your number of test cases.

For improving an individual object

Look closely at the first chart dimension and the key fields/associations involved in it as sorting candidates
If the AGGR function is used, sort by the dimension(s) in the expression, ex. Dim1 and Dim2 in Median(Aggr(Sum(Sales), Dim1, Dim2)). If these fields are in different tables, it will likely be worthwhile to move them to the fact table for AGGR and to simplify the sorting.

If you do denormalize, test different orders of joining tables, which affects the final sort orders and performance, even though the data model looks the same. (Scenario 4a vs. 4b)
Test and measure, test and measure. Use data to decide the best approach for your app.
Don't worry about any of this for small, fast applications :)

Source files

Original QVW from Vlad
QVF I used for testing (requires bundled Variable and DevTool extensions)

MICHAEL STEEDLE

Search This Blog

No code, on-demand language translation in Power BI