Filter versus Sum If performance in Tableau

BI platforms usually work best if you assume they work similarly to a relational database. Messy calculations generate messy queries, complex interfaces generate a lot of queries, normalized schemas generate extra join overhead, etc. Many of the same adages apply for BI performance as database performance, even though the BI layer may be something done locally with data stores on the BI server, without directly involving a database.

For platforms that enable you to push work to a database through live query options, like Tableau, Power BI, and ThoughtSpot, understanding what SQL is generated and how to create efficient queries in the database are critical to decent performance -- yes, even if you have a powerful cloud database, like Snowflake. Besides affecting performance and thus the user experience, inefficient queries may also cause you to have to pay for a larger compute warehouse than you would otherwise need.

It was with this in mind that I set up a quick test in Tableau to see the performance difference between two ways of filtering data, and then I dug into the generate VizQL to see if I could understand why.

Setting up two "filter" scenarios

My data model in this case is a fully flattened extract, so all fields are part of the same physical table. (This is a common Tableau data model.)

First create a filter condition. My test case is to filter my table to only rows for the year 2015, and aggregate the results to the month level. Similar realistic scenarios might be including or excluding certain transaction types or a more dynamic point-in-time calculation, like Is CYTD.

We can reuse this modular, binary calculation for both filter scenarios to ensure consistency across them.


Then create a measure. Using the condition we just created, we will create a Sum If-style measure to return only the Locate Count from 2015.


We will also use the condition above as a filter, but no special measure is needed for that chart.

Then create the charts. I am going to create the same visualization twice, using different ways of filtering the data.

The first is the Sum If version of the chart: a line chart with 2015 Closed Locates by Month of Year.


Next, duplicate that sheet and name it Filter. Replace the 2015-only measure with a simple sum of Closed Locate Count. For the moment, we are returning data from all years, not just 2015.

Finally, drag Is 2015 to the Filters pane and set the value to true. The results should match across sheets now.


I also created a blank sheet to use as my landing page, where I saved my Workbook before closing it to conduct the tests. All tests start on the blank page, where I can start the performance recording before moving onto the sheets with visualizations.

Results, and the underlying VizQL

Now, we test performance. Dust off the Tableau Performance Recorder, which will log the calculation performance as well as the generated VizQL, Tableau's query language. 

My result was that the Filter query was on average about 40% faster than the Sum If. (I did tests activating the sheets in different orders to account for any additional time waking up the extract on the first sheet with visuals.)

This is the result from one of the iterations only.


Average Calculation Time Generated VizQL
Sum If .25 seconds SELECT
TABLEAU.TO_DATETIME(DATE_TRUNC('MONTH', TABLEAU.NORMALIZE_DATETIME(""Fact.csv"".""Month Start Date"")), ""Fact.csv"".""Month Start Date"") AS ""tmn:Month Start Date:ok"",
SUM((CASE WHEN (CAST(EXTRACT(YEAR FROM ""Fact.csv"".""Month Start Date"") AS BIGINT OR NULL) = 2015) THEN ""Fact.csv"".""Closed Locate Cnt"" ELSE null::int END)) AS ""usr:Calculation_588282743316205569:ok""
FROM ""Extract"".""Fact#csv"" ""Fact.csv""
GROUP BY 1

Filter .17 seconds SELECT
SUM(""Fact.csv"".""Closed Locate Cnt"") AS ""sum:Closed Locate Cnt:ok"",
TABLEAU.TO_DATETIME(DATE_TRUNC('MONTH', TABLEAU.NORMALIZE_DATETIME(""Fact.csv"".""Month Start Date"")), ""Fact.csv"".""Month Start Date"") AS ""tmn:Month Start Date:ok""
FROM ""Extract"".""Fact#csv"" ""Fact.csv""
WHERE (CAST(EXTRACT(YEAR FROM ""Fact.csv"".""Month Start Date"") AS BIGINT OR NULL) = 2015)
GROUP BY 2


Notice that in the VizQL, the condition ended up in different places -- the Sum If query has the condition in the SELECT, while the Filter query has it in the WHERE clause.

Note: For accurate performance testing in Tableau Desktop, clear your cache between test cases -- especially if retesting something you have already run. Here is a knowledge base article about how to clear your cache. Briefly: in C\:Users\<username>\AppData\Local\Tableau\Caching, delete the contents of the TemporaryExtracts and ExternalCacheV1 folders.

Tying the results back to how most databases work

The above queries were executed against a Tableau extract, but what might we expect running similar queries against a relational database? We would expect similar results -- the WHERE clause would likely be more efficient.

In relational databases, the WHERE clause is executed before the SELECT, narrowing the dataset before it gets to the SELECT, so the SELECT runs against less data. Things done in the SELECT tend to be less efficient. Here is an example order of operations, from Microsoft:

  1. FROM
  2. ON
  3. JOIN
  4. WHERE
  5. GROUP BY
  6. WITH CUBE or WITH ROLLUP
  7. HAVING
  8. SELECT
  9. DISTINCT
  10. ORDER BY
  11. TOP

It can get more complicated if your platform tries to combine queries, like Tableau and Power BI (through horizontal fusion), so it definitely warrants testing if you are working with large datasets.

Implementation guidance

  • If you have a choice, you can expect better performance (and potentially create fewer measure expressions) using filters rather than writing conditions directly in the measures.
  • There are some times when you still may wish to implement a Sum If-style calculation.
    • For self-service, creating precalculated, self-contained point-in-time measures makes drag-and-drop visualization creation easier.
    • Additionally, some visualizations contain multiple time periods, ex., MTD and Prior MTD, where a single filter may not easily generate the calculations you need, so the standalone measures are needed.
  • Creating binary calculations for the conditions, like Is 2015 in the example above, is a modular unit of logic that can be used in a Sum If or a Filter, for perfect consistency across visualizations and easier maintenance if you need to change the logic later. It's a good development practice.

Quickly swinging by other BI platforms

In Power BI, using CALCULATE and FILTER for the time period in a measure calculation is like applying the WHERE clause, before the Sum is calculated over the data set. You can also apply a filter just to the chart, like Tableau, but the ultimate query is the same, regardless of the approach.

In Qlik, Set Analysis in an expression measure is like applying a filter to the data model before the expression is calculated, so you can think of this as working like the WHERE clause as well, from a calculation standpoint.

I have noticed in ThoughtSpot's generated SQL that in the case of a Sum If measure, it also automatically generates the same condition in the WHERE clause. I guess they have determined that this approach is likely to be faster than the Sum Ifs alone.

Contact Form

Name

Email *

Message *