Qlik developer primer for what's different in Power BI: Power Query (part 2/2)

Example of query organization in Power Query

In these posts I am introducing the major concepts and differences of Power Query compared to the Qlik load script, for new Power BI developers who are familiar with Qlik or Qlik developers who just want to know how other platforms do things differently. See part 1 here for introduction and context.

Queries

A Qlik load script is one, long, serially executed script, while building a Dataset in Power BI is made up of many distinct, individual queries. The order in which the queries are listed in Power Query does not convey anything about the order in which they will be executed, unlike the tabs of a Qlik script. Power Query will figure out its own sequence, including potentially starting multiple queries in parallel.
Each query is made up of many steps, run sequentially. A step could be anything from dropping fields to changing a data type to executing a join (AKA Merge). Steps may be combined into a single query when executed, but that’s generally outside of your control. This is different from Qlik, where you designate all the transformations to include in each discrete load, like you would in a SQL statement.

A query is a unit of logic, not a unit of work

A query is more like a database view than a table. You can reference a query many times as a data source, but each time it will rerun the query logic. It will not run the query once and reuse the results. You can't easily bank the results of a query to speed up subsequent steps. This is a big deal if you are used to building and then querying a table, which is a frequent occurrence when transforming data in Qlik, ex., generating inventory snapshots, or literally any resident load. The upside is that you do not need to manage a potentially tangled web of dependencies, which you can create in Qlik, and it's easier for queries to run in parallel because the tables you're building are not interdependent.
Reusing a query's result is a use case for Dataflow, which is a different process and interface than Power Query. It's a deeper topic for another time.

Individual queries can be run on-demand in the Power Query interface without refreshing the whole Dataset, kind of like running queries from a SQL client. If your query is dependent on upstream queries, they will automatically run, as well. This is different from Qlik, where the main option is to stop a load script early, but always starting from the beginning.

This is so handy for one-off questions to try to understand source data and transformation results. For this reason I find Power Query well-suited for data profiling and exploration, especially SQL data sources, because the data and most work remain in the source database via query folding. I wrote about this here and here.
A query can itself be queried to learn about its results. However you can't, for example, query a query as of the 4th step in it out of 7 total steps. So, segmenting a many-step query into multiple smaller queries that reference one another is an easy way to organize different phases of transformations, ex., 1) source table and filter query, 2) join other tables and derive new fields query, 3) aggregate results query. Alternately you can duplicate a query, delete the steps you don't want to run, and query that.

If your query is dependent on other queries, Power Query automatically stitches those together to execute in the end, as though all of the steps were combined in a single query. You can preview these dependencies under the View menu, Query Dependencies.
You can use wizards to traverse semi-structured data (JSON, XML) and even databases natively in the Power Query interface. Each step in the navigation is just another step in the query.

M language

M, the language of Power Query, is not SQL-like, the way that Qlik load script is. In fact, it is complicated as heck, in my humble opinion. Knowing SQL is barely relevant. Wizards will generate much of the M code you need, but once you need to do something outside of a wizard, it is exponentially more difficult and complex.
When completely wizard-driven, you just read the generated steps in the code from the top to the bottom, each step creating a variable that references the previous step's results. (You can divert from this once you get into the advanced stuff.)

The Advanced Editor – how you view/edit the raw M code -- is not user-friendly. It doesn't even have find and replace, so be prepared to copy code out of Power Query into a text editor and then back in. By comparison the Qlik load script editor is significantly better. I love the text searches that show the match counts on each tab there.
I think the biggest advantage of M is that Power Query will automatically convert what M it can to SQL (for compliant data sources) and push that work to the data source, through a process called query folding. You could even switch your connection from using a temporary file export for prototyping to a SQL database table with the same layout, not touch the M code, and Power Query will automatically start folding some of the code you originally wrote for the file.
M is case-sensitive, including all functions. For example, typing NULL throws an error because it is expecting lower case null.
Types in M are weird. I don't mean data types, but the fact that some functions and transformations can return data in the form of a table, record, list, or value, any of which may be shown in a single cell of a table, like semi-structured data.
There's a good, detailed primer on M here: Power Query M Primer (Ben Gribaudo)

Query folding

The way to speed things up in Power BI is to do as much as possible before the data gets to Power BI, through query folding. Query folding is M code being translated to SQL and running natively against the source database. This is a major shift from Qlik, where the focus is generally to get the atomic data to QVDs in the Qlik environment, after which you can transform that data locally, efficiently.
Folding can only possibly happen for data sources that can be queried with SQL. It happens automatically for as many steps at the beginning of a query as it determines it can translate to SQL, after which the data is brought into Power BI and the remaining logic is processed in memory.
There is no equivalent to learning Power Query such that you can hack enormous throughput like you can with understanding the nuances of optimized QVD loads, but query folding can make a huge difference and happens automatically for well-written, thoroughly considered code. By comparison, you can get pretty amazing throughput from Qlik with large data volumes in QVDs, but it does add more dependencies, including the additional orchestration of getting the granular data to the Qlik environment in the first place.
How much is folded depends on what transformations you do and in what order you do them. Prioritize filtering data, joins, and aggregation before steps that may break folding, particularly any steps that reduce data volume.
Not everything that you expect or want to fold does, ex., changing data types. Just because you know how you would write the SQL to do something doesn't mean that Power Query will figure out how to write the SQL.
The only way to be certain whether a query step is folding is clunky -- having to right-click each query step until you find that View native query is disabled from the context menu. For steps that do fold, selecting that option will show you the native SQL it will run against the source.

Folding can be idiosyncratic. For example, inner joining a query to reduce the data will show as a step that cannot be folded. Expanding at least one field that was joined will re-enable the folding, even if you subsequently remove the fields you added. Put another way: general BI knowledge does not usually transfer to tuning Power Query.
Nuances in a query -- even when foldable --can generate more complex, nested SQL, especially if steps are dependent on previous steps. For example, if a line cost is determined in one query step, calculating the line margin in the next query step using the derived line cost will create an additional nested SQL statement. Calculating the line margin using the building blocks of the line cost instead will add the business rule to the existing SQL statement without another nested layer.

Reloading data

During a Dataset refresh, Power Query figures out dependencies and attempts to run many queries in parallel, automatically. Qlik always runs a script serially. Other than creating more applications, there's no way to parallelize Qlik reload work.
Basic Incremental refresh (insert only) is wizard-supported for rolling periods. It's a little clunky because you must build it in the desktop, deploy it to the service, then can never download it back again. This is handy but not always a great fit, especially if the data you are using goes through several transformations, like aggregation. In Qlik you can write any kind of incremental update supported by the source, including deletions, with any transformations, but it must be coded manually. (You can find examples where others have done this for you, like Rob Wunderlich's QlikView Components.)
Native Power BI scheduling options are limited -- running on a schedule. More complex scenarios you might hand-code in Qlik, ex., do incremental loads each day and a full refresh every Sunday, would require the XMLA endpoint or something else to find a way to accomplish it. In Qlik, different triggers can be added to the same task, or the Qlik load script itself can execute different code on different days.

And that's all you need to learn about Power Query, except how to use it.

MICHAEL STEEDLE

Search This Blog

No code, on-demand language translation in Power BI

Qlik developer primer for what's different in Power BI: Power Query (part 2/2)

Queries

M language

Query folding

Reloading data

Contact Form