If it interests you to reduce calculation time by upwards of 50% with minimal coding and in a way that is invisible to users, please read on. If you only work with small applications and/or a small number of users, this won't matter as much (but still might be interesting).
I wrote about this a while ago, which Rob has also recently mentioned, but feel the topic needs deeper explanation and ideas for how to use it to your advantage.
Discovery
I noticed the effect of sorting Qlik data model tables during a project in 2011. It started with observing that the order in which I joined dimensions to a fact table resulted in different compressed file sizes and different calculation performance, despite having the same data model structure, volume, and values in all scenarios. So weird. Then I went so far as to join a table to itself, resident loading just one field -- literally the exact same table, before and after -- and that was also changing the file size and performance. Weirder still.I soon figured out why that was happening: when you join two tables in Qlik, the resulting table is ordered by the key field that was used to join -- like an ORDER BY, but ordered by the key field values' original load order rather than ascending or descending order. Thus, if you join a series of tables, the ultimate table is ordered by the last key field used for a join.
Ordering the data by a field (as opposed to the arbitrary row order) also changed how much Qlik was able to compress the data on disk and the speeds of calculations in the application.
Try it!
1. Copy and paste this script into a new application to generate our sample data set. Save and reload.
Fact:
LOAD
Ceil(Rand() * 10) as [Customer ID],
Round(Rand() * 1000, .01) as [Sales Amt]
AUTOGENERATE
100;
//LEFT JOIN (Fact)
//LOAD DISTINCT
// [Customer ID]
//RESIDENT
// Fact;
2. In the data model viewer, note that the order of Customer IDs in the fact table is random.
3. Uncomment the LEFT JOIN, save, and reload again. NB: I am doing a JOIN because it takes significantly less time and overhead for the same result, compared to resident loading the table (usually a very large table) with an ORDER BY. More details here.
4. In the data model viewer, note that the like Customer IDs in the fact table have been grouped together, as a result of the join on Customer ID.
The next post will contain a robust implementation and recommendations for testing it on your application. (I split this into two parts because I'm so wordy. Sorry.)
On to Part 2.
Fact:
LOAD
Ceil(Rand() * 10) as [Customer ID],
Round(Rand() * 1000, .01) as [Sales Amt]
AUTOGENERATE
100;
//LEFT JOIN (Fact)
//LOAD DISTINCT
// [Customer ID]
//RESIDENT
// Fact;
2. In the data model viewer, note that the order of Customer IDs in the fact table is random.
3. Uncomment the LEFT JOIN, save, and reload again. NB: I am doing a JOIN because it takes significantly less time and overhead for the same result, compared to resident loading the table (usually a very large table) with an ORDER BY. More details here.
4. In the data model viewer, note that the like Customer IDs in the fact table have been grouped together, as a result of the join on Customer ID.
Congratulations! You have just "indexed" a table in the Qlik data model.
Why it matters, even if you don't plan to use it
Whether you intend to hack the sort orders of your tables, you need awareness of this phenomenon because it is a hidden variable that affects/explains performance. Because sorting happens as a result of joins, when you are trying to change or test one variable -- joining a dimension to the fact table -- you are also changing another variable -- the fact table sort order -- that has its own effect on performance.
This can cause confusing and hard to generalize results, like joining one table and performance improving, then joining a second table and performance degrading. It may lead you to question whether denormalizing is a performance gain or loss.
Your table may have also been sorted due to a previous transformation, without your awareness, before you start your testing. In that case, if the original/mystery/baseline sort order leads to better performance than the scenarios you deliberately tested, you may think sorting is a detriment to performance.
In both cases, understanding the underlying behavior provides context to unexpected performance results.
The next post will contain a robust implementation and recommendations for testing it on your application. (I split this into two parts because I'm so wordy. Sorry.)
On to Part 2.