Data Privacy and the Formula Firewall in Power BI / Power Query

Have you ever received any Errors similar to these?

Formula.Firewall: Query ‘Merge1’ (step ‘Source’) is accessing data sources that have privacy levels which cannot be used together. Please rebuild this data combination.

Formula.Firewall: Query ‘Query1’ (step ‘Added Custom’) references other queries or steps, so it may not directly access a data source. Please rebuild this data combination.

This is the blog post where I’ll cover this specific topic to answer 3 questions:

  • WHY does it even exist
  • What triggers this error
  • How to better spot the cause of the error

Before we start…let’s define one concept / feature:

What is Data Privacy in Power Query / Power BI?

Whenever you connect to any data source, you need to define 2 things for that connection:

  • Authentication Method – it can be implicit (when you connect to a file in your computer) or explicit (a new authentication window will appear for the options available)

  • Privacy Level – this is usually an optional window, but you can select from None (default state), Public, Organizational or Private. You can see this by going to ‘Data Source Settings’ –> ‘Edit Permissions’

The part that we want to focus on right now is the Privacy Levels and here is a simple table that will break down the options that we have available:

Setting Description
Private A Private data source is completely isolated from other data sources. Contains sensitive or confidential information.
Organizational An Organizational data source is isolated from all Public data sources, but it is visible to other Organizational data sources
Public A Public data source gives everyone visibility to the data contained in the data source

Here’s a brief summary that Ehren from the Power Query team sent me to better understand this:

  • Public data can be sent to Public/Organizational/Private sources.
  • Organizational data can be sent to Organizational/Private sources (but not to Public sources).
  • Private data can’t be sent anywhere (even to other Private sources).

Why does this Formula.Firewall error even exist?

Data Privacy levels act as a firewall or “gate”. It’s a protection mechanism for your Data Privacy.

The Firewall exists to prevent data from one source from being unintentionally sent to another source.

In an ideal world, you’d mark all of your data sources correctly so that this system can protect you from sending crucial and sensitive information to a data source that should not know about it.

Data Leakage is a really bad thing and we want to avoid it, hence why we have this “defense” system in place which is the Data Privacy or the Formula Firewall system.

How can my data from one data source end up in a completely different data source??? That, my friends, it’s because of something called Query Folding.

In short, query folding is the process where the M code gets translated into the native query language of a data source.

At times you can even check if Query Folding is happening by right clicking on a step of your query and selecting the option for “Native Query”:

This is done for performance purposes and to push as much work as possible to the data source instead of loading ALL of the data to your PC and working locally because of it.

Please note that even if the View Native Query button is greyed out, Query folding might still be happening. For example, if you connect to an OData source, you won’t be able to see the “View Native Query”, but most of the actions will be “folded”.

Query folding and Lazy evaluation will be a topic that I’ll cover in a future blog post. For now, Ehren’s explanation on how Query Folding impacts the Formula Firewall is perfect enough:

As part of folding, PQ sometimes may determine that the most efficient way to execute a given mashup is to take data from one source and pass it to another. For example, if you’re joining a small CSV file to a huge SQL table, you probably don’t want PQ to read the CSV file, read the entire SQL table, and then join them together on your local computer. You probably want PQ to inline the CSV data into a SQL statement and ask the SQL database to perform the join.

This is the culprit of it all and the main reason why there’s no “one size fits all” approach.

It’s because query folding might behave differently depending on your query steps and you don’t want to get rid of query folding because you want your queries to take advantage of it whereas possible.

Ehren from the Data Integrations team (the guys behind Power Query and the modern Get Data Experience) started a new discussion that goes in depth as to what happens inside the firewall process. If you’re an advanced user interested in learning the ins and outs of what goes behind the scenes with the formula firewall, then this is a discussion and article for you. Check it out here.

The post here, that you’re reading right now, is more of a summary so you can better understand what this Formula Firewall and Data Privacy Levels are.

What’s the benefit of the Data Privacy Levels?

Before we jump right into the errors, this feature (the Formula Firewall) doesn’t get that much credit because is not as exposed as other features.

Why would I want to go out of my way to set the privacy level of each of my sources?

Let me give you an example on the benefits of the Formula Firewall (without any errors):

  • We have 1 CSV file that is stored in SharePoint Online – this file contains a value that we would like to use as a Parameter. It’s in the header of the one of the sheets and it basically tells us what Territory that file belongs to and what data we need to get from the database. This is how that query looks like:

  • We have a SQL Server database with MILLIONS of records – we don’t want to load the millions of records, but rather just the data that we’re interested in using that parameter from the CSV file. This is how that query looks like:

THIS is where query folding is extremely important. You DO NOT want to load all of the data and only then filter it.

You only want to fetch the data already filtered from the database and that’s why query folding exists.

Here’s where we’re at. 2 Queries, but they don’t talk to each other yet:

The next thing that we need to do is simply do the filter on the Territory column of the “Sales SalesOrderHeader” query which comes from the database:

we use a placeholder like “Panama” and the modify the code to use the name of the query that has the argument that we want to use.

The result of that will look like this:

our final diagram view will look like this:

Let’s do some testing to see what’s going on

The first thing that we need to set up now is the Data Privacy Levels of each data source.

To do that, make sure that you’re in the Power Query Editor window and:

  1. Go to the Home Menu
  2. Click the Data Source Settings button
  3. Once the Data Source Settings window comes up, select the data source that you want to edit
  4. Click on Edit Permissions
  5. On the Edit Permissions window, select the privacy level.

In this case, I’m setting up both data sources to have an Organizational Privacy Level.

When I go back to the Sales SalesOrderHeader query, I can right click on the last step and see that the ‘View Native Query’ option is enabled and when I click on it I see this:

Check the text highlighted in yellow! This is essentially the proof from Power Query that we are not downloading the millions of records that this database has, but we’re only getting the data for France which is the value from the ‘SalesTerritory’ query that comes from a CSV. This is how Query Folding works between multiple data sources.

Now, what happens if we set up the database to have a private privacy level?

well, what happens is that the ‘View Native Query’ option will be greyed out and what’s going on behind the scenes is that query folding is being performed only based on the transformations that you’re doing for the steps prior to the one that involves the value from the CSV / SharePoint Query.

Query folding will only happen for what goes within that specific Data Source (SQL Server database) and when it’s time to mashup that data with another data source, query folding will stop.

You’re fundamentally working with a local copy of your database and only then filtering the data using the value in the CSV from SharePoint.

At the same time, this is a major benefit if you were trying to prevent sensitive data to be sent to this database instead of just a simple parameter.

What triggers the Formula.Firewall error?

In most cases, it’s the fact that a single query has 2 or more data sources in it that are trying to work together, but they either have incompatible privacy levels OR undefined privacy levels.

In a more technical sense, Query Folding is the culprit of it all, because it tries to mash up different data sources for better performance, but, in order to better understand the situations and how to resolve it, we need to know what’s causing the issue. What is triggering this Formula.Firewall.

Error on ‘Query is accessing data sources that have privacy levels which cannot be used together’

Formula.Firewall: Query ‘Merge1’ (step ‘Source’) is accessing data sources that have privacy levels which cannot be used together. Please rebuild this data combination.

The error message is directly referring to the table previously shown in this post.

In short, what’s happening is that within the same query, you’re trying to access multiple data sources and they’ve been set up with incompatible privacy levels, meaning that you need to head over to the Data Source Settings window and set up your ALL your data sources in that query to the same privacy level; either ALL Public or ALL Organizational.

Once this is done you’ll notice that the issue will be completely gone.

Error on ‘Query references other queries or steps, so it may not directly access a data source’

This is probably the most common one, especially for the ones that use a Table from their current workbook inside of Power Query for Excel.

The legend himself (my friend) Ken Puls has written about this before here.

In his blog post, Ken explains how you can lay down your queries through stages such as:

  1. Import your Data and do most of the steps that could result in query folding in those “Staging” Queries
  2. Reference your staging queries and do the final transformations / shaping and load those to either Excel or your Data Model

That approach seems to work quite nicely but, why do we get that error?

Some people still experience the error even after reading that post and, like anyone else, they’d like to know why they keep hitting that error.

The answer: your query is what I’d like to call a “Chained” query.

What is a Chained query? well, this is term that I completely made up, but it’s my way of calling a query that has references that are “out of the scope” of the current query.

By “out of the scope”, I mean that you have multiple data sources in 1 query and one of these data sources has not been properly defined / determined / evaluated.

When does this usually happen? The common scenario where I’ve seen this happening is with queries that try to connect to either a web service or a REST API and where you need to evaluate one step (or query) and then use the result of that one for the next and you end up in this cycle / iteration / pagination. If you’re getting this error, please tell me more about your scenario in the comments section

What options do I have to fix this? You have 3 options:

  1. Ignore Privacy Levels – You’ve probably read this one before, but you can just tick the box to ignore the privacy levels. This will work locally, but not on the Power BI Web Service.
  2. Create a Power BI Custom Connector – this is BY FAR the best option as you can make sure that your queries can run in an optimal way and work seamlessly in the web service as well. Not to mention that you also have some features that are unique to custom connectors like being able to read the response headers of your calls and using setting your OAuth 2.0 flow. Sadly, this is only available for Power BI, but if you’re only going to work within Excel then the first option should be sufficient for your case
  3. Embedding or masking your Data Sources inside Functions – this method will make your data sources not visible to the Data Privacy Levels at first, but you can tune your queries to define the data source at first and then apply a function.

In regards to that last option (or method #3)

Back in the day (2015-2016), you would’ve to pull a MacGyver move in order to “add a new data source” inside of Power Query.

For example, if your SaaS had a REST API, but it didn’t have the connector in Power BI, you could use the Web.Contents function to grab the data using GET requests.

Of course, you’d need to know M code and read the API documentation to make it happen, but things would rapidly get complicated depending on things like the Authentication method, pagination and throttling.

Then after ALL your hard work creating those queries to get your data, you’d publish your report to the Power BI Service just to find out that your queries can’t be refreshed on the service:

This happened to me many times and while there are some workarounds, they just make me feel like I’m in the middle of a minefieldand I’m terrible at Minesweeper.

There are quite a bit of technicalities on why you would and should create a Custom Connector instead of trying to write the queries directly with Web.Contents and they’re mostly related to the Privacy Levels and the Formula Firewall.

If you’re ever considering connecting to a source that is not listed in the ‘Get Data’ window, I highly recommend that you create a Custom Connector for it and if you need help creating one for your company, then you can always contact me.

2 Comments

  1. Suffer from the second error. Exactly the same scenario as described “try to connect to either a web service or a REST API and where you need to evaluate one step (or query) and then use the result of that one for the next and you end up in this cycle / iteration / pagination. ”
    I need one column’s value of previous query whose source is gotten by REST request as parameter to do REST request again as source of the later query.

    why customer connector can detour the pain? And would you please share how to create that customer connector?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.