Improved Web Scraping Experience

A lot of things have happened in the last couple of days. I’m for the first time in my life in Seattle spending some quality time with people that I’ve never met in my life other than through the power of the internet. It’s a nice feeling to finally put a face to a name that you always see on your inbox.

I’ll be speaking later today, but I had to get this post out to the world as soon as possible.

I’ve covered the new Web Scraping experience before in here, and talking with the product team today at the Business Applications Summit they (Ehren specifically) told me that they just updated and improved the experience, so he showed me a demo and I took home my homework so I could implement what he showed me on my own dataset

What’s new

If you want to follow along, just use the following url to get data from the web:

https://www.amazon.com/s?keywords=power+query

Now when you use ‘Add table from examples’ and you input your examples, you get a list of possibilities:

which helps you pin down exactly the css selector that you need.

The second point is that once you actually define your CSS selectors and what columns/data you want, you can now click on the OK button and a new “custom table” will appear on the Navigator window which is super convenient.

The last and most important new feature is that you can finally access attributes inside of your html. Something that I couldn’t do in my previous post was to grab the url of each specific book, but as of yesterday I can do that inside of Power BI Desktop. Here’s the query that I ended up creating:

let
Source = Web.BrowserContents(“https://www.amazon.com/s?keywords=power+query”),
#”Extracted Table From Html” = Html.Table(Source, {{“Title”, “.s-access-detail-page”}, {“Author(s)”, “.a-col-right > .a-spacing-small > :nth-child(2)”}, {“Total Reviews”, “div > div > div > div.a-fixed-left-grid-col.a-col-right > div:nth-child(2) > div.a-column.a-span5.a-span-last > div > a”}, {“Average Rating”, “[class*=””a-icon a-icon-star””]”}, {“Release Date”, “span.a-size-small.a-color-secondary”},{“url”, “.a-col-right > .a-spacing-small:nth-child(1) > :nth-child(1) > .a-link-normal “, each [Attributes][href]}, {“imgurl”, “img.s-access-image.cfMarker”, each [Attributes][src]}}, [RowSelector=”:not([class*=””AdHolder””])[id*=””result_””]”]),
#”Added Index” = Table.AddIndexColumn(#”Extracted Table From Html”, “Index”, 1, 1),
#”Changed Type” = Table.TransformColumnTypes(#”Added Index”,{{“Title”, type text}, {“Author(s)”, type text}, {“Total Reviews”, Int64.Type}, {“Average Rating”, type text}, {“Release Date”, type date}, {“imgurl”, type text}, {“url”, type text}})
in
#”Changed Type”

Notice that every element from the Html.Table’s first parameter (a list of lists), the most inner lists now accepts 3 arguments where the first one is the name for the new column, the 2nd the actual element that we’ll be grabbing and the last one a function that tells how to extract the attribute from that specific html element. Here’s the current documentation that was added to the latest version of Power BI Desktop specifically for this functionality:

Finally, thanks to this new feature I’m able to grab the urls of things like the book images (src) and also any links to the direct page of the book (href) so my final report can have a look more like this:

Special Thanks  to Ehren from the Power Query team for showing me this. Without him I would not have noticed that there was an updated experience since it was listed on the release notes of the latest Power BI Desktop.