Why is the combined size of partitioned QVDs smaller on disk than just creating one QVD?

Note in this screenshot that the first QVD is larger than combining ten equal row count slices of it, stored to QVDs. A customer recently noticed this phenomenon and asked why it occurred.


To start: why partition QVDs?

Partitioning saves read/write time of large QVDs when incrementally updating very large data. If a QVD contains 100M rows of data comprising 3 years of history, the entire 100M-row QVD must be loaded from and written back to disk, even if just one row is updated. Alternatively, if the QVD is partitioned by, say, transaction month, the chances are good that only a few (or just one) partitioned QVDs would need to be updated for a given reload, leaving dozens of historical QVDs untouched.

Also because this reduces the amount of time that any app "touches" any individual QVD -- reading or writing -- it decreases the likelihood of a conflict between an app reading a QVD while another tries to update it, or vice versa, preventing possible reload failures.

Note that there may be a small amount of additional reload overhead, overall, reading from a suite of partitioned QVDs rather than a consolidated QVD. You should weigh the pros and cons for your unique situation. (Bear in mind that once the data is recombined, it doesn't matter whether it came from one QVD or many: the in-memory data model is detached from its sources, so it will have no impact on UI app performance.)

Understanding QVD size, consolidated and partitioned

So if you partition your data, why might the combined sizes of the partitioned QVDs be smaller on disk than just creating one QVD?

It sounds counterintuitive, based on what I know of how Qlik organizes data. Many people know that Qlik stores each unique field value just once, in a symbol table. But if the same value recurs in each partition, ex. Product IDs when partitioning by Month, the Product ID values will have to be stored once in each partitioned QVD instead of just once, overall, in a consolidated QVD.

On the surface, this sounds like you might expect the collective partitioned QVDs to be larger than the consolidated QVD, but there is more that goes into storage requirements than the user-facing field values.

1. Number of distinct values in field: changes the number of bits required to store pointers

  • This means that as soon as the number of distinct values passes another threshold of raising 2 to a higher power (2, 4, 8, 16, 32...), there is a step up in the storage required for the pointers in that column, in both the symbol and data tables.
  • I believe this is what is making the partitioned QVDs smaller. In an example I created, partitioning 10M rows into 10 QVDs of equal size reduces the number of bits required to store the primary key from 24 bits in a 10M-row QVD to just 20 bits in each of the 10 1M-row QVDs.
    • 10M distinct field values = Ceil(Log(10000000) / Log(2)) = requires 24-bit column 
    • 1M distinct values in each partition = Ceil(Log(1000000) / Log(2)) = requires 20-bit column

2. Length of distinct, human-understandable values in field: each must be stored once, and longer values (like free text) require more storage

  • The storage from this would likely increase from partitioning. This was the first thing that came to mind for me. If the same values are repeated in different partitions, those values would have to be stored once in each QVD slice instead of just once, overall. But the increase may be lower than the reduction in storage from the first point.

3. Number of rows in table/QVD: how many pointers must be stored in the data table

  • This would be the same with or without partitioning because the number of rows would be perfectly additive, but it is related to the first point in that it amplifies the pointer size.

So I believe the mystery is solved, and it reinforces how Qlik is working, behind the scenes, making seemingly puzzling behavior easier to understand. It was also a good excuse to discuss the potential advantages of partitioning QVDs when incrementally updating large data volumes and the effect of reducing distinct values in fields on reducing the footprint of an application, a frequent recommendation when optimizing application performance.

Contact Form

Name

Email *

Message *