New Web Scraping experience in Power BI / Power Query (Using CSS Selectors)

The latest version of Power BI Desktop came out with a new Web Connector specifically designed for Web Scraping Scenarios. in this blog post I’ll try to go deep into how this new experience works and how you can take advantage of it.

Before we move forward, you’re gonna need the latest version of Power BI Desktop (May 2018 for me) and also enable the Preview feature in the Options window:

image

My Scenario: Get Data From Amazon

I want to find out how many books are out there that have anything to do with Power Query. I want the reviews that go with them, the name of the authors, when they were released and, of course, the names of these books.

The best place to find this information is probably Amazon. So I went on Amazon and did a quick search using the keywords “power query”.

image

I then grabbed the url that it created for me, head over to Power BI Desktop, chose the “Web Connector”:

image

and went through the usual experience…except for one new button called “Extract Table Using Example”:

image

Clicking on it should take you to the new connector experience which looks like this:

image

On the upper side you get a preview of your webpage and underneath you can provide a sample of the data that you want to extract from that specific webpage. This is similar to the “Column From Example” experience that we know, but it works a little bit different under the hood.

I highly recommend that you expand this window to full screen mode so you can take the most advantage out of it.

Again, What I’d like to do is extract the following data from each book:

  • Title
  • Author(s)
  • Total Reviews
  • Average Rating
  • Release Date

To do that, all I need to do is simply go through the webpage and provide examples for each of those columns:

image

The values in the darker font are the ones that I entered, while the ones in a lighter font are the ones that Power Query was able to come up with following that specific Pattern.

All we need to do now is hit the OK button and then we’ll be taken to the Query Editor to further play with our data.

image

Here’s one thing that caught my eye. We have some “sponsored” data which are basically ads. I don’t want any ads in my dataset, so how can I get rid of these ads? Well, you can do a simple “filter” to get rid of it inside of Power Query, or we can go under the hood and optimize this query to our needs Smile

Figuring out what’s going on under hood

This new experience relies on 2 new functions and 1 new window (to input the examples). These 2 new functions are:

  • Web.BrowserContents – returns a text snapshot of the html DOM for the specific url. It has an optional parameter that accepts a record with a field “WaitFor” so you can set up a timer or a trigger on when to exactly take the DOM snapshot. (Incredibly useful for pages that heavily rely on javascript)
  • Html.Table – returns a table containing the results of running the specific CSS Selectors against the provided html. It has an optional parameter that accepts a record with the field “RowSelector” so you can specify how to find new rows in the html document. (can’t tell you how useful and convenient this is)

Let’s look at the code that Power Query created for me:

let
  Source = Web.BrowserContents(“https://www.amazon.com/s?keywords=power+query”),
#”Extracted Table From Html” = Html.Table(Source, {{“Title”, “.s-access-detail-page”}, {“Author(s)”, “.a-col-right > .a-spacing-small > :nth-child(2)”}, {“Total Reviews”, “.a-col-right > :nth-child(2) > .a-span5 > DIV.a-row.a-spacing-mini:nth-child(1):nth-last-child(1) > .a-size-small”}, {“Average Rating”, “.a-col-right > :nth-child(2) > .a-span5 > DIV.a-row.a-spacing-mini:nth-child(1):nth-last-child(1) > SPAN”}, {“Release Date”, “.a-col-right > .a-spacing-small > :nth-child(1) > .a-size-small”}}, [RowSelector=”[id*=””result_””]”]),
     #”Changed Type” = Table.TransformColumnTypes(#”Extracted Table From Html”,{{“Title”, type text}, {“Author(s)”, type text}, {“Total Reviews”, Int64.Type}, {“Average Rating”, type text}, {“Release Date”, type date}})
in
#”Changed Type”

Note: You can find a list of all the available css selectors here.

Let’s modify our code to support a more complex scenario. Let’s imagine that we need to wait 15 seconds before taking the DOM snapshot (for the Web.BrowserContents function) and we also need to make sure that we’re only extracting data from the book titles that are not sponsored (ads).

Adding a WaitFor optional Parameter

In theory, we should be able to simply change our first step to be like this:

= Web.BrowserContents(“https://www.amazon.com/s?keywords=power+query”,[WaitFor=#duration(0,0,0,15)])

and that should make Internet Explorer Chromium (through Power Query) wait 15 seconds before taking the snapshot. Unfortunately, when I tested this I couldn’t make it work (not even with the samples provided in the actual function), so I sent an email about this to the Power BI folks that handle this.

Since this is a preview feature, is understandable that some things might not work, but I wasn’t expecting to have problems with even the samples that they provided.

image

Either way, if this ends up working some day, this will be a CRUCIAL part of any web scrapping scenario inside of Power Query. You can also define other rules like “Wait for xyz element to load” before taking the snapshot.

UPDATE 14-May-2018 12:34pm : Just received some feedback from one of the PMs at Microsoft. The documentation is incorrect and the correct usage for the WaitFor record should be like the folowing 

[WaitFor = [Selector = “div.foo”, Timeout = #duration(0,0,0,10)]]

Modifying the CSS Selectors created by Power Query

OK – so what is CSS and what are CSS Selectors? For the first question, you can find a full article about it here and for the second one, imagine that it’s a syntax so you can target or select a specific element.

Usually web pages follow a set of patterns to style and name things. For example, in a grid we can “color” each line or row using a specific “id” or “class” pattern, thus saving time so we don’t have to set up a specific color scheme to each individual line or row inside of a grid.

In most web browsers you can enable a developer tab or dock (inspect element) so you can see the elements of your html page. In my case, I’m using Chrome and I can go through the code to see what’s going on and I noticed that Amazon uses a list to display their products and within each of those lines there’s a pattern for the id. Notice how each of them start with the string “result_”, but here’s the thing – even the “Sponsored” products have that pattern for their id.

What sets them apart? Well,each line also has a class and they all have almost the same text string except that the sponsored products have “AdHolder” near the end of them:

image

Power Query uses something called a “Row Selector” which is basically a record or field of a record that tells Power Query that every time that it finds a certain pattern, to put that data inside a new row.

Now let’s go ahead and take a look at the “RowSelector” record that Power Query automatically created for us:

let
Source = Web.BrowserContents(“https://www.amazon.com/s?keywords=power+query”),
#”Extracted Table From Html” = Html.Table(Source, {{“Title”, “.s-access-detail-page”}, {“Author(s)”, “.a-col-right > .a-spacing-small > :nth-child(2)”}, {“Total Reviews”, “.a-col-right > :nth-child(2) > .a-span5 > DIV.a-row.a-spacing-mini:nth-child(1):nth-last-child(1) > .a-size-small”}, {“Average Rating”, “.a-col-right > :nth-child(2) > .a-span5 > DIV.a-row.a-spacing-mini:nth-child(1):nth-last-child(1) > SPAN”}, {“Release Date”, “.a-col-right > .a-spacing-small > :nth-child(1) > .a-size-small”}}, [RowSelector=“[id*=””result_””]”]),
#”Changed Type” = Table.TransformColumnTypes(#”Extracted Table From Html”,{{“Title”, type text}, {“Author(s)”, type text}, {“Total Reviews”, Int64.Type}, {“Average Rating”, type text}, {“Release Date”, type date}})
in
#”Changed Type”

The specific CSS Selector that is using is this one and the definition states that:

“The [attribute*=value] selector matches every element whose attribute value containing a specified value.”

What’s happening is that every time that Power Query finds an element that satisfies that selector,  Power Query will create a new row with the corresponding data. In simple terms, there should be as many rows as there are ids with that specific pattern of “result_”…BUT! we need to add another logical condition to that. We want those elements that have the “result_” pattern for their id, but we also want to not target or extract data from the classes that contain “AdHolder” and for that we’re going to replace that RowSelector with:

[RowSelector=“:not([class*=””AdHolder””])[id*=””result_””]”]

which explicitly does what we want. The result of that shows me this:

image

After modifying the CSS Selectors that Power Query initially created for me, I ended up with the following code:

let
Source = Web.BrowserContents(“https://www.amazon.com/s?keywords=power+query”),
#”Extracted Table From Html” = Html.Table(Source, {{“Title”, “.s-access-detail-page”}, {“Author(s)”, “.a-col-right > .a-spacing-small > :nth-child(2)”}, {“Total Reviews”, “div > div > div > div.a-fixed-left-grid-col.a-col-right > div:nth-child(2) > div.a-column.a-span5.a-span-last > div > a”}, {“Average Rating”, “[class*=””a-icon a-icon-star””]”}, {“Release Date”, “span.a-size-small.a-color-secondary”}}, [RowSelector=”:not([class*=””AdHolder””])[id*=””result_””]”]),
#”Added Index” = Table.AddIndexColumn(#”Extracted Table From Html”, “Index”, 1, 1),
#”Changed Type” = Table.TransformColumnTypes(#”Added Index”,{{“Title”, type text}, {“Author(s)”, type text}, {“Total Reviews”, Int64.Type}, {“Average Rating”, type text}, {“Release Date”, type date}}),
#”Split Column by Delimiter” = Table.SplitColumn(#”Changed Type”, “Average Rating”, Splitter.SplitTextByDelimiter(” “, QuoteStyle.Csv), {“Average Rating”}),
#”Changed Type1″ = Table.TransformColumnTypes(#”Split Column by Delimiter”,{{“Average Rating”, Currency.Type}}),
#”Renamed Columns” = Table.RenameColumns(#”Changed Type1″,{{“Index”, “Position”}})
in
#”Renamed Columns”

Caveats: whenever you want to click the gear icon, be sure to hit the refresh button first or you might be working against cached data. The “window” experience to enter data from example might work the first time, but if your webpage changes then this window might crash. The Html.Table function only exposes span attributes for now – this limitation is by-design, but I can imagine that at one point we could look at specific attributes like “src” and simple “href” so we can get urls from the webpage (for images and links).

Conclusion: Web Scraping is now CRAZY simple with this feature, but this also offers a lot of flexibility for people that want to go hardcore with Web Scraping. There are still some things that you can’t do with Power Query like going through a page that requires a log in or “sign in” page. There’s no automatic way of doing a pagination yet in Power Query, but I believe that they might be working on exposing a native Power Query function that could get the job done – right now you need to create your pagination function as shown here.

Check out the report that I created from this dataset:

Don’t forget to leave your feedback in the comments section below!

UPDATE 24-July-2018: Check the improved Web Scraping experience here.