A few years ago I posted my first impressions on Power BI Incremental refresh that was only made available for Power BI Premium licenses.
The whole idea of incremental refresh is superb:
- You only refresh the data that you need – never again a FULL refresh needs to take place
- It makes your data source work less – so you won’t be hitting your data source with expensive queries that take a long time to execute and load
- Highway to insights! – you don’t have to wait those long hours just for your dataset to refresh
For the incremental to work as intended your data source has, to a certain degree, be able to handle query folding and most structured data sources like databases and OData are able to handle this with ease. How about files from a Folder either hosted locally or on SharePoint?
Note: Before you continue reading, I just wanted to say that this is an advanced topic on not only Power BI, but also on Power Query. If you or your company requires help setting this up, you can contact us at firstname.lastname@example.org for a quote on any type of assistance setting you up with the incremental refresh that works best for you.
The Scenario: Files in a SharePoint Folder
Imagine that we have Folder that has subfolders. One subfolder for every year and inside of those yearly folders we have the data for each of those years. It looks like this:
What we want to do is connect to this folder and then set up an incremental refresh logic, so we only refresh the data for the current year or, in other words, only get new data and keep the data from previous years to be intact and not be in the refresh process as it adds way too much processing / refresh time.
That means that while we’re in January of 2020, or any months in the year 2020 for that matter, we’ll only be refreshing the data for 2020. Once we get to the year 2021, we’ll start only refreshing the data for 2021 and the data for the year 2020 will be untouched.
In our case, we have Excel files in those folders, but we have SO many that it takes too much to refresh all of them.
The solution: taking advantage of Query Folding and Lazy Evaluation
Note: I highly recommend that you read these few articles if you intend to implement this on your own so you can understand what I’m doing:
What I need to do is quite straightforward. I need to use a Data Source function as a base and then create my custom function that will have the correct logic to drive the Incremental refresh with the RangeStart and RangeEnd parameters that the Power BI Incremental Refresh process requires.
We have 2 options in terms of Data Source functions. We can either use the “From Folder” (Folder.Files) data source function for local files or we can use the “SharePoint Files” (SharePoint.Files) for files hosted on SharePoint:
For this example, I’ll be using the SharePoint Folder connector / data source as I’m able to track what is going behind the scenes and if the incremental refresh is working correctly or not.
Step 1: Create the Custom Function
This is the custom function that I’ve created, but the part drives the magic is actually the Filtered Rows step:
#”Filtered Rows” = Table.SelectRows(Source, each List.Contains( List.Transform( Years, (x)=> Text.Contains( [Folder Path],x) ), true))
It is also super important that the names of the Folders are correctly entered and that the setup that we have for the naming convention of the folders is taken into consideration.
You could change this to your specific needs, and I’ve played with sub folders for the months and created other scenarios where there is an incremental refresh logic specifically for the months.
Again, you can customize this to your needs and all you need to know is a bit of the M language.
Step 2: Create the Parameters for Incremental Refresh
This part is quite straightforward. You just need to go through the Parameters window to add new parameters as I described in my first impressions article for the incremental refresh on Power BI.
Step 3: Using your new Custom Function
Now you just need to invoke that custom function. In my case my Power Query window looks like this:
For the sake of testing, just to make sure that we’re only getting the data that we want, let’s change the RangeEnd parameter value to be 2/1/2019 (February 1st, 2019). Now go into the Tools Menu (enable it from the preview features if you don’t have it yet) and select the option for Diagnose Step:
Once you click on it, refresh the query which will create a new set of queries that you’ll see in the queries pane on the left:
Go to the one that reads “Detailed” in its name. In there, go to the “Data Source Query” field and you’ll see all the calls that were made to the SharePoint API. In my case, I only see that data for the year 2018 and 2019 was requested from the API. Not the one from the year 2020 which is exactly the intended behavior that I was looking for:
For the sake of cleansing, let’s delete all of these new queries that were created by the diagnostics tool and also change the RangeEnd to be today’s date. In my case, I’m writing this blog post on the 18th of January. To me, it looks like this:
And all you have to do now is just hit “Close and Load” to load your data into your Data Model. This initial refresh will take a while, but we’re just a few steps from having incremental refresh for our files.
Step 4: Set up the Incremental Refresh Policy
Now that our data is in our Data Model, we need to set up the incremental refresh for that specific table / query. Right click on the query and select the incremental refresh option:
Once inside the incremental refresh window, I need to set it up to store only the data for the last 2 years and then only refresh the rows in the last (current) year.
You can absolutely change this and customize it to your needs, but just make sure that you also customize your function and your setup on your folders.
At this point, you’ve finalized everything that has to happen on the Power BI Desktop and you need to now publish your file to the service.
You can follow the steps found on the official Power BI documentation here (url).
Testing our Incremental refresh
Once the file has been published, go ahead and do the initial refresh of the file. I used a pretty small dataset so the refresh took just a few seconds:
The way that I can verify that everything is working as intended is by using the XMLA endpoint for this workspace/dataset and see how the partitions were created.
Checking from SQL Server Management Studio, I can see that all 3 partitions were created:
Don’t be fooled by the number of rows. I’m using the same files in each folder, but the column that we should be looking at is the “Last Processed”. That’s how the partitions look after their initial creation.
Now, I waited a few minutes and then refreshsed the dataset a few more times and check how the partitions look now:
Check how the values for the partitions on year 2018 and 2019 are exactly the same, but the timestamp for the 2020 partition has changed.
This pattern can be customized either for a any repository of files such as SharePoint or a local folder.
If you need professional help setting this up, you can contact us at email@example.com any time.
We’ll help you create an Incremental Refresh Policy that fits your needs to almost any data source.