Improved Web Scraping Experience in Power BI

A lot of things have happened in the last couple of days. I’m for the first time in my life in Seattle spending some quality time with people that I’ve never met in my life other than through the power of the internet. It’s a nice feeling to finally put a face to a name that you always see on your inbox.

I’ll be speaking later today, but I had to get this post out to the world as soon as possible.

I’ve covered the new Web Scraping experience before in here, and talking with the product team today at the Business Applications Summit they (Ehren specifically) told me that they just updated and improved the experience, so he showed me a demo and I took home my homework so I could implement what he showed me on my own dataset

What’s new

If you want to follow along, just use the following url to get data from the web:

https://www.amazon.com/s?keywords=power+query

Now when you use ‘Add table from examples’ and you input your examples, you get a list of possibilities:

which helps you pin down exactly the css selector that you need.

The second point is that once you actually define your CSS selectors and what columns/data you want, you can now click on the OK button and a new “custom table” will appear on the Navigator window which is super convenient.

The last and most important new feature is that you can finally access attributes inside of your html. Something that I couldn’t do in my previous post was to grab the url of each specific book, but as of yesterday I can do that inside of Power BI Desktop. Here’s the query that I ended up creating:

let
Source = Web.BrowserContents(“https://www.amazon.com/s?keywords=power+query”),
#”Extracted Table From Html” = Html.Table(Source, {{“Title”, “.s-access-detail-page”}, {“Author(s)”, “.a-col-right > .a-spacing-small > :nth-child(2)”}, {“Total Reviews”, “div > div > div > div.a-fixed-left-grid-col.a-col-right > div:nth-child(2) > div.a-column.a-span5.a-span-last > div > a”}, {“Average Rating”, “[class*=””a-icon a-icon-star””]”}, {“Release Date”, “span.a-size-small.a-color-secondary”},{“url”, “.a-col-right > .a-spacing-small:nth-child(1) > :nth-child(1) > .a-link-normal “, each [Attributes][href]}, {“imgurl”, “img.s-access-image.cfMarker”, each [Attributes][src]}}, [RowSelector=”:not([class*=””AdHolder””])[id*=””result_””]”]),
#”Added Index” = Table.AddIndexColumn(#”Extracted Table From Html”, “Index”, 1, 1),
#”Changed Type” = Table.TransformColumnTypes(#”Added Index”,{{“Title”, type text}, {“Author(s)”, type text}, {“Total Reviews”, Int64.Type}, {“Average Rating”, type text}, {“Release Date”, type date}, {“imgurl”, type text}, {“url”, type text}})
in
#”Changed Type”

Notice that every element from the Html.Table’s first parameter (a list of lists), the most inner lists now accepts 3 arguments where the first one is the name for the new column, the 2nd the actual element that we’ll be grabbing and the last one a function that tells how to extract the attribute from that specific html element. Here’s the current documentation that was added to the latest version of Power BI Desktop specifically for this functionality:

Finally, thanks to this new feature I’m able to grab the urls of things like the book images (src) and also any links to the direct page of the book (href) so my final report can have a look more like this:

Special Thanks  to Ehren from the Power Query team for showing me this. Without him I would not have noticed that there was an updated experience since it was listed on the release notes of the latest Power BI Desktop.

2 Comments

  1. HI Miguel,

    I’m not following at a fundamental level here. Is this example Power BI Desktop? Which release, perhaps August 2018? Also, I don’t think I’ve ever seen ‘Add table from example’ but maybe that is because I usually use Excel version of Power Query.

    thanks

    • This is in PBI Desktop. I don’t think the Excel version of PQ has been updated in almost a year, and PQ features in Power BI were already well ahead of the Excel releases. Excel PQ used to get updates to the add-in module at least twice a year, so it’s possible any future PQ updates to Excel could only come through O365 now. A bit annoying as PQ is insanely powerful in Excel and has replaced 90% of work I used to have to do with VBA (having the .pdf connector for Excel would be great).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.