Performance hacking: "indexing" Qlik data model tables (part 2/2)

This is part 2 of a long-winded set of posts on this topic. See part 1 here. Then, welcome back.




Using it, on purpose

Armed with this information, I began testing the effect on performance of sorting the fact table at the center of the star in large applications by different fields, using a very fast join load, which can be stored in a subroutine, for reuse:

(Note: there is a faster way to do this, added here)

SUB IndexResidentTable( v_TableName, v_GroupFieldName )

LET v_StartSubTimestamp = Now();

TRACE [MESSAGE]:  Grouping like '$(v_GroupFieldName)' field values in table '$(v_TableName)' to improve calculation performance;

LEFT JOIN ([$(v_TableName)])
LOAD
FieldValue('$(v_GroupFieldName)', RecNo()) as [$(v_GroupFieldName)]
AUTOGENERATE
FieldValueCount('$(v_GroupFieldName)');

LET v_SubDuration = Interval(Now() - v_StartSubTimestamp, 'h:mm:ss');

TRACE [MESSAGE]:  Finished grouping like '$(v_GroupFieldName)' field values in table '$(v_TableName)' in $(v_SubDuration);

LET v_TableName = ;
LET v_GroupFieldName = ;
LET v_StartSubTimestamp = ;
LET v_SubDuration = ;

END SUB

Now, the advent of the Qlik Sense Document Analyzer has made it very easy to measure the detailed calculation time effects for a variety of scenarios. (Note: I collaborated with Rob on the Qlik Sense Document Analyzer, but not on the awesome feature of determining calculation times of objects. That was all him.)

Recommendations

I wish I could say I have found a consistent way to predict which field will work best for you and your application, but each instance of data model, key cardinality, UI design, and expressions are so different that I have found it hard to generalize. It will likely require some trial and error (made easy with the subroutine and Document Analyzer), as well as defining your goal: What are you trying to optimize? Do you want to lower the total/average time for all objects, or are you willing to forgo that to speed up one problematic object as much as possible?

Here are some ideas to test:
  • Order by a very commonly used field -- one that is involved in many calculations, ex. the date field in the fact table that is associated to a calendar table
  • Order by a very common used dimension in the fact table -- one that is present in many charts
  • Order by a key field to a dimension table that is frequently used*
  • Order by a field that is present as a dimension in a slow but necessary AGGR function, ex. Dim1 in Median(Aggr(Sum(Sales), Dim1))
  • If you plan to try to speed up one slow object, look very closely at the fields and key fields/associations used in it
(I am focusing on the prototypical star schema with a large fact table at the center with many-to-one relationships with dimensions surrounding it. There are other possibilities worth exploring in other data models, like sorting keys on both sides of a many-to-many relationship.)

You try it

  1. Pick a large app that has slower calculation performance than you would like.
  2. Start by running the Qlik Sense Document Analyzer against it, to get a baseline calculation time and to make a list of what some good candidate fields might be.
  3. Copy my subroutine from above into a tab at the beginning your script. (It must be created before it is called.)
  4. On a tab at the end of your script, call the subroutine using this syntax: CALL IndexResidentTable( 'Table Name', 'Field Name' );
  5. For each field you would like to test...
    1. Change the subroutine to refer to the target field for that test case
    2. Reload
    3. Don't open the UI, which would cache the calculation results and understate the true calculation times
    4. Run the Qlik Sense Document Analyzer and make note of the new calculation times
  6. When you have tested all scenarios, implement the one that gave you the most desirable results.
    Good luck, and let me know what you find. With enough examples, we may be able to make some general conclusions about which fields work best.


    *In one case, I compared 1) joining a dimension table to the fact table to 2) indexing the fact table by the key field that associated to the dimension, leaving it as an association. The calculation performance was identical, but the size on disk and RAM consumption were smaller by not joining the table, due to the smaller data table size. In short: With indexing, I got the same performance as denormalizing without the overhead in app footprint of actually doing it.

    Contact Form

    Name

    Email *

    Message *