Since it’s origins, Power Query / Power BI has had this feature called Group By and you can see it under the main menu and the Transform ribbon under the following icon:
is not a really descriptive icon. It doesn’t give you that much information other than something is dependent with something else (via that line).
What does Group by do? When should I use Group by?
In short, the Group By Operation inside Power BI / Power Query tries to do 2 things:
- Summarize your Data – you get your table summarized by only the columns that you select. This is amazing if you’re trying to get rid of duplicates or to check where you have duplicates.
- Provide Aggregations or Non aggregated Data – imagine these new columns that will provide aggregations such as the sum, max, min, average of a column and in some cases other columns that will not do any aggregation and will only the grouped rows as a table
You should use the Group By functionality any time that you need to do anything that has to do with grouping rows from a table based on the values that they have in their field/s.
Let’s go straight into real-world examples of when you might want to use this Group by feature and what it brings to the table
Be sure to click on the following button in order to download the sample file with also the solutions.Download sample file
1. Summarize Data
Original Dataset: We have data that looks more like a report with all of the fields rather than something that we would use inside a Power BI / PowerPivot Data Model.
Goal: Normalize our dataset and create a Customers Dimension Table for our Power BI Data Model. We would have a fact table with only the customer key and another table with all the fields for customers.
How to group rows with Power BI / Power Query for this ?
Here’s the step by step of what we need to do:
- Head over to the sheet 1 or, if using Power BI Desktop, connect to the table within the sheetname “1” from the sample workbook.
- Name this Query “Original”
- Reference the “Original” Query twice and name one of those references “Dim_Customers” and the other one “Fact_Sales”
Now that we have these 3 queries, the whole goal is to only load the “Dim_Customers” and the “Fact_Sales” to our Data Model.
In a more technical sense, we are dealing with what it’s called a denormalized table and we need to normalize it (reduce the redundancy of data) by basically moving most of those fields to a new table and only keeping 1 field that will act as the “key” for our customers. I just so happen to call that field “CustomerKey” to make it easier for this example, but in the real world it might be called something else.
Creating a Dimension table for Customers
Let’s work on that “Dim_Customers” query. In the original table you’ll see that I marked some columns with a yellow color. I did this because all of those fields are all referring to a single “object” or “element” and that is the customer.
Click on the Group By icon and then in the Group By window select the Advanced option. Then for the Group by fields select CustomerKey, Customer, Category, Group, Primary Contact as shown in the next picture:
The rest you can leave it as default.
The result will be a summarized table with no duplicates for our customer fields and a new column called “Count” which we can just remove. After removing that “Count” column, you’ll end up with your table exactly as you need it:
Normalizing our Fact Table
Our goal with this query is super simple. Let’s delete all of the fields that have anything to do with the newly created dimension table for customers, but keep the CustomerKey field so we can create the relationship between tables.
In a more visual way, let’s delete the fields highlighted in red in the picture below:
You simply select those fields in red (Customer, Category, Group, and Primary Contact) and then do a right click on either one of those columns and select the option that reads “Remove Columns”:
The result of that operation will give you a table that looks like this:
and with that you have your Fact_Sales table ready to be loaded to your Data Model.
Building our Data Model and creating the report
if you’re in Power BI Desktop, you can select your queries from the “Queries” pane and make sure that only the Fact_Sales and Dim_Customers load to your Data Model, but inside of Power Query for Excel you need to first load your queries as “connection only” and then load them to your Data Model.
The main key here is that you need both of those tables / queries that we just created in your Data Model and then inside of it you can create a relationship between those 2 tables using the CustomerKey field from both tables. You can simply drag one field from one table to the field of the other table using the Diagram view and the app will create the relationship for you. The end result will look like this:
With that out of the way, you can focus on just creating your report. In my case, I ended up creating this report inside of Excel which is basically a top 10 customers by order total from each Customer Group
The main takeaway here is that this principle can be used for any Dimension or any type of Normalization scenario that you can think of.
There is another valid way of doing this and that is by simply keeping the columns that you need and then remove the duplicates from those columns. Again, completely valid but its a matter of preference at that point.
Excel specialist turned into BI specialist using the latest tools from Microsoft for BI – Power BI. He is the co-author of ‘M is for Data Monkey’, blogger and also Youtuber of powerful Excel video Tricks.
He has been recognized as a Microsoft Most Valuable Professional (MVP), is a Microsoft Certified Professional (MCP – MCSA: BI Reporting), a Microsoft Certified Trainer (MCT), and is one of the international pioneers in Power Pivot, Power Query and Power BI.