Incremental refresh for files in a Folder or SharePoint – Power BI

A few years ago I posted my first impressions on Power BI Incremental refresh that was only made available for Power BI Premium licenses.

The whole idea of incremental refresh is superb:

  • You only refresh the data that you need – never again a FULL refresh needs to take place
  • It makes your data source work less – so you won’t be hitting your data source with expensive queries that take a long time to execute and load
  • Highway to insights! – you
    don’t have to wait those long hours just for your dataset to refresh

For the incremental to work as intended your data source has, to a certain degree, be able to handle query folding and most structured data sources like databases and OData are able to handle this with ease. How about files from a Folder either hosted locally or on SharePoint?

Note: Before you continue reading, I just wanted to say that this is an advanced topic on not only Power BI, but also on Power Query. If you or your company requires help setting this up, you can contact us at info@poweredsolutions.co for a quote on any type of assistance setting you up with the incremental refresh that works best for you.

The Scenario: Files in a SharePoint Folder

Imagine that we have Folder that has subfolders. One subfolder for every year and inside of those yearly folders we have the data for each of those years. It looks like this:

What we want to do is connect to this folder and then set up an incremental refresh logic, so we only refresh the data for the current year or, in other words, only get new data and keep the data from previous years to be intact and not be in the refresh process as it adds way too much processing / refresh time.

That means that while we’re in January of 2020, or any months in the year 2020 for that matter, we’ll only be refreshing the data for 2020. Once we get to the year 2021, we’ll start only refreshing the data for 2021 and the data for the year 2020 will be untouched.

In our case, we have Excel files in those folders, but we have SO many that it takes too much to refresh all of them.

The solution: taking advantage of Query Folding and Lazy Evaluation

Note: I highly recommend that you read these few articles if you intend to implement this on your own so you can understand what I’m doing:

What I need to do is quite straightforward. I need to use a Data Source function as a base and then create my custom function that will have the correct logic to drive the Incremental refresh with the RangeStart and RangeEnd parameters that the Power BI Incremental Refresh process requires.

We have 2 options in terms of Data Source functions. We can either use the “From Folder” (Folder.Files) data source function for local files or we can use the “SharePoint Files” (SharePoint.Files) for files hosted on SharePoint:

For this example, I’ll be using the SharePoint Folder connector / data source as I’m able to track what is going behind the scenes and if the incremental refresh is working correctly or not.

Step 1: Create the Custom Function

This is the custom function that I’ve created, but the part drives the magic is actually the Filtered Rows step:

#”Filtered Rows” = Table.SelectRows(Source, each List.Contains( List.Transform( Years, (x)=> Text.Contains( [Folder Path],x) ), true))

It is also super important that the names of the Folders are correctly entered and that the setup that we have for the naming convention of the folders is taken into consideration.

You could change this to your specific needs, and I’ve played with sub folders for the months and created other scenarios where there is an incremental refresh logic specifically for the months.

Again, you can customize this to your needs and all you need to know is a bit of the M language.

Step 2: Create the Parameters for Incremental Refresh

This part is quite straightforward. You just need to go through the Parameters window to add new parameters as I described in my first impressions article for the incremental refresh on Power BI.

Step 3: Using your new Custom Function

Now you just need to invoke that custom function. In my case my Power Query window looks like this:

For the sake of testing, just to make sure that we’re only getting the data that we want, let’s change the RangeEnd parameter value to be 2/1/2019 (February 1st, 2019). Now go into the Tools Menu (enable it from the preview features if you don’t have it yet) and select the option for Diagnose Step:

Once you click on it, refresh the query which will create a new set of queries that you’ll see in the queries pane on the left:

Go to the one that reads “Detailed” in its name. In there, go to the “Data Source Query” field and you’ll see all the calls that were made to the SharePoint API. In my case, I only see that data for the year 2018 and 2019 was requested from the API. Not the one from the year 2020 which is exactly the intended behavior that I was looking for:

For the sake of cleansing, let’s delete all of these new queries that were created by the diagnostics tool and also change the RangeEnd to be today’s date. In my case, I’m writing this blog post on the 18th of January. To me, it looks like this:

And all you have to do now is just hit “Close and Load” to load your data into your Data Model. This initial refresh will take a while, but we’re just a few steps from having incremental refresh for our files.

Step 4: Set up the Incremental Refresh Policy

Now that our data is in our Data Model, we need to set up the incremental refresh for that specific table / query. Right click on the query and select the incremental refresh option:

Once inside the incremental refresh window, I need to set it up to store only the data for the last 2 years and then only refresh the rows in the last (current) year.

You can absolutely change this and customize it to your needs, but just make sure that you also customize your function and your setup on your folders.

At this point, you’ve finalized everything that has to happen on the Power BI Desktop and you need to now publish your file to the service.

You can follow the steps found on the official Power BI documentation here (url).

Testing our Incremental refresh

Once the file has been published, go ahead and do the initial refresh of the file. I used a pretty small dataset so the refresh took just a few seconds:

The way that I can verify that everything is working as intended is by using the XMLA endpoint for this workspace/dataset and see how the partitions were created.

Checking from SQL Server Management Studio, I can see that all 3 partitions were created:

Don’t be fooled by the number of rows. I’m using the same files in each folder, but the column that we should be looking at is the  “Last Processed”.  That’s how the partitions look after their initial creation.

Now, I waited a few minutes and then refreshsed the dataset a few more times and check how the partitions look now:

IT WORKS!

Check how the values for the partitions on year 2018 and 2019 are exactly the same, but the timestamp for the 2020 partition has changed.

Final thoughts

This pattern can be customized either for a any repository of files such as SharePoint or a local folder.

If you need professional help setting this up, you can contact us at info@poweredsolutions.co any time.

We’ll help you create an Incremental Refresh Policy that fits your needs to almost any data source.

Posted in:

15 Comments

  1. Hi Miguel,

    Query folding is not available for folder data sources (local, SP etc.)
    Microsoft warns that incremental refresh without query folding can result in worse performance.
    Your example demonstrates how to use incremental refresh against a SP folder and confirms that only the “incremental” data is updated in the model. However without query folding it is still possible (likely ?) that _all_ data is being pulled from the source and then partitioned locally on the server, before updating only the recent partition.
    As all data is still being pulled the refresh itself _might_ still run slowly or consume a lot of resource.
    Have you actually confirmed that the refresh performance is improved, not just that only the recent data is updated ?

    This
    https://ideas.powerbi.com/forums/265200-power-bi-ideas/suggestions/15387612-refresh-only-new-data-or-modified-file-in-folder

    • Hey Andrew,

      I highly encourage you to give this scenario a try against SharePoint online so you can test things out. Lazy evaluations are key here, so as long as you know to take advantage of those, query folding is possible with Local Files and the SharePoint API.
      You can also use Query diagnostics to see how things are handled in Power BI Desktop and how Query folding is occurring.
      Furthermore, The SharePoint connector does have query folding enabled by default as it’s an Odata source – some row level filters are not fully supported, but you could create a custom connector to take advantage of the record metadata such as date modified, date accessed and date created to drive an incremental refresh solution from a SharePoint site. How cool is that?

      TLDR; query folding and lazy evaluation are present in the scenario described in this post 🙂 test it out

    • ah! In regards to the performance, it is quite a noticeable improvement when you’re dealing with dozens of large files as it only refreshes the files that need to be refreshed and not all of them.
      The only thing that we can’t take advantage of for Incremental Refresh with files is the usage of a “date modified” field, since we could have that field at the file level, but not at the actual data level.

  2. What are the limits for incremental refresh – for example, I assume incremental refresh only works when the data or transformations for each segment do not cross the file boundaries? If we are tracking lease payments and the files have monthly data, we need to know a start and end date for the lease. If we do a Group By or some other transformation, do we still have to do a full data load so that the proper dates are recorded?

  3. Miguel, What abt your forthcoming book. You said earlier that it be out by march 2020. Now march is just round the corner. Is the book finally getting out or still there will be some delay or the idea has been shelved completely. Please do reply.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.