Premium Dataflow Gotchas (AKA things I struggled with) and a Hack

Below is a summary of my notes from the first time I used Dataflows with Enhanced Compute. It's mostly "gotchas" -- behavior I didn't expect or that I found frustrating -- but I also found an interesting performance hack.

I did later come across this post (datalineo: "Power BI dataflows: a Deep Dive into Incremental Refresh & the Enhanced Compute Engine"), which I wish I had found first, because it would have saved some of my effort. It has tips for staging large data, including explanation of linked vs. computed entities and how to make sure you are leveraging the Enhanced Compute Engine.

Important background: The Enhanced Compute Engine -- a Premium feature -- stores Dataflow entities to a SQL data store in addition to the default storage format. Loading from the Dataflow can then leverage the SQL engine for some transformations, just like some work can be pushed to a source database with query folding when you normally load data with Power Query. This can be faster than loading from a standard Dataflow, and even more so when it comes to operations like aggregation.

My testing scenario specifically involved loading data into a Dataflow with Enhanced Compute, then loading from that into another Dataflow, trying to leverage the Enhanced Compute to speed up that second refresh.

Premium Dataflow gotchas

When loading from Dataflows with Enhanced Compute enabled, disabling linked table queries disables the Enhanced Compute engine. In Power Query, I am in the habit of disabling queries I don't need to persist. I did this in the linked queries, which I did not need to persist, as is, which broke folding for queries loading from it. The lightning icon indicates that the Enhanced Compute Engine will be used.

L3M and APM (last 2 queries) enabled, All History query takes advantage of the Enhanced Compute Engine (lightning icon)


L3M and APM (last 2 queries) disabled, All History query loses lightning icon

Even when a query step will fold with the Enhanced Compute Engine when running a Dataflow refresh, the query step will not say it is foldable. As a bonus frustration, when Power Query Online doesn't detect that a step will fold, the Preview of the step also doesn't bother trying to fold, so developing and clicking through the step previews can be very slow.


In combination, these made it difficult to tell what would fold, what actually folded, and why things didn't fold (if I could even tell that happened at all).

Dataflow Incremental Refresh requires datetime fields, specifically. Date alone will not work. It is very easy to set up, though -- easier than in a Semantic Model.

Speeding up a slow Premium Dataflow refresh

In testing performance of loading CSVs compared to Parquet files into Dataflows, I was surprised to find that filtering the data in a source CSV loaded faster than I expected. I was expecting almost no difference because it still has to scan every row, but it was significantly faster than the full load.

This gave me the idea to see if I could speed up a full refresh CSV load by partitioning the Dataflow, resulting in several parallel loads. There aren't options for partitioning in Dataflows like there are for tables in Semantic Models, but I know partitioning happens automatically as a byproduct of configuring incremental refresh, a Premium Dataflow feature.

So, I set up incremental refresh to maintain history for the last 2 years and refresh the last 2 years, which translates to a full refresh. However, the "year" setting partitions the Dataflow into parallel loads at the yearly level. This scenario refreshed in 1/3 less time than doing a normal full refresh of the same file.

Note that this does add more concurrent threads, which are limited based on your capacity, so it's not without consequence. I also confirmed this works against databases, but keep in mind the workload this puts on the source database (which wasn't a concern when loading from a CSV).


Contact Form

Name

Email *

Message *