When creating the relationship between a fact and dimension in Power BI, a many-to-many relationship warning was triggered, to my surprise. I expected the key in DimProduct, ProductID, to be unique, but had introduced multiple rows per Product ID value somewhere along the way in my transformations.
DimProduct was created by joining six tables together -- the source Product table load followed by five merges (joins) of other tables. Which step was causing the issue of increasing my row count? Which table had a granularity I misunderstood? This could be a lot to unravel.
The data preview in Power Query for each query step can be helpful, but is based on a limited number of rows (1000). The only option for changing this is to use all rows, which slows every query preview to a crawl.
Fortunately, two simple M functions employed in basic, reusable queries enabled me to quickly and systematically debug the transformations, without leaving the Power Query interface. Here I'll review those functions and the specifics of how I debugged this join scenario.
Table.Profile
Table.Profile returns field-level aggregates with summary metadata about a target query. This is useful when trying to understand the general properties of a data source, whether fields are fully distinct, whether they contain null values that should be accounted for or replaced, etc. This function provides comprehensive results from the source, while you can never be totally confident based on the field profile in a 1000-row preview .
To generate the output above, from Power Query, click New Source, then Blank Query. In the formula bar at the top, type (including the equals sign):
= Table.Profile(#"DimProduct")
Rename the query to something more descriptive, like Table Profile, then right-click to disable the load, so the results do not persist in the data model.
Table.RowCount
Table.RowCount returns the number of non-distinct rows resulting from a target query. This information is also available as the "Count" column from the Table Profile query results above, but that profile can take some time to run on large datasets, and row counts alone are sometimes sufficient to answer basic questions, particularly in my accidental many-to many situation.
To generate the output above, from Power Query, click New Source, then Blank Query. In the formula bar at the top, type (including the equals sign):
= Table.RowCount(#"DimProduct")
Follow the same instructions from above to rename as Table RowCount and disable the load.
A variation you may sometimes wish to use is to find the number of distinct rows in a target query. Here's is the syntax:
= Table.RowCount(Table.Distinct(#"DimProduct"))
How I used the functions to debug where extra rows originated
- Run Table Profile against DimProduct, just to confirm the many-to-many. Sure enough, there were more rows in DimProduct than there were distinct values in ProductID. Darn.
- Run Table RowCount against the source Product table -- the first table used in building DimProduct. ProductID was fully distinct in this table, and it had fewer rows than the DimProduct table I created. Now I confirmed it was not a source data issue and was instead something I introduced.
- Duplicate the DimProduct query (right-click query name, then Duplicate), so I could freely delete steps from the copy while debugging without having to worry about losing my original work. Rename query as DimProductDebug.
- Starting from the end of the DimProductDebug query, delete all query steps from the last merge through the end (right-click final merge step, then Delete Until End). Run Table RowCount against DimProductDebug. Has the resulting row count changed -- decreased back to the row count of the source Product table?
- If the row count is still too high, repeat the process. The merge that was deleted was not the culprit. Starting from the end of the DimProductDebug query, delete all query steps from the new final merge through end, then recheck the DimProductDebug row count.
- Once the problem merge that created rows is identified, run the Table Profile query against the table that was joined in that step, to better understand its grain and inform how it should be joined.
Other quality and profiling ideas in Power Query
- Field profile, a slimmed down version of Table Profile, run against only one field (with less overhead)
- Field distribution, because the quality/distribution data built into Power Query is based on just the first 1000 rows
- Relationship profile to determine the distinctness of keys and integrity of a relationship before you build it in the model view or start using it to drive visualizations
- Query compare, to ensure results of two queries are the same, ex., after rewriting a query to optimize performance, you would want to ensure you got the same results (spoiler: solution here)
- Regression queries specific to your Dataset to rerun each time you make changes, and ensure the results have not unexpectedly changed, or to compare data in the source to data in the Dataset