Quick Hit: Scraping javascript-“enabled” Sites with {htmlunit}

I’ve mentioned {htmlunit} in passing before, but did not put any code in the blog post. Since I just updated {htmlunitjars} to the latest and greatest version, now might be a good time to do a quick demo of it.

The {htmlunit}/{htmunitjars} packages make the functionality of the HtmlUnit Java libray available to R. The TLDR on HtmlUnit is that it can help you scrape a site that uses javascript to create DOM elements. Normally, you’d have to use Selenium/{Rselenium}, Splash/{splashr} or Chrome/{decapitated} to try to work with sites that generate the content you need with javascript. Those are fairly big external dependencies that you need to trudge around with you, especially if all you need is a quick way of getting dynamic content. While {htmlunit} does have an {rJava} dependency, I haven’t had any issues getting Java working with R on Windows, Ubuntu/Debian or macOS in a very long while—even on freshly minted systems–so that should not be a show stopper for folks (Java+R guaranteed ease of installation is still far from perfect, though).

To demonstrate the capabilities of {htmlunit} we’ll work with a site that’s dedicated to practicing web scraping—toscrape.com—and, specifically, the javascript generated sandbox site. It looks like this:

Now bring up both the “view source” version of the page on your browser and the developer tools “elements” panel and you’ll see that the content is in javascript right there on the site but the source has no

elements because they’re generated dynamically after the page loads.


View Source & DevTools Elements Views

The critical differences between both of those views is one reason I consider the use of tools like “Selector Gadget” to be more harmful than helpful. You’re really better off learning the basics of HTML and dynamic pages than relying on that crutch (for scraping) as it’ll definitely come back to bite you some day.

Let’s try to grab that first page of quotes. Note that to run all the code you’ll need to install both {htmlunitjars} and {htmlunit} which can be done via: install.packages(c("htmlunitjars", "htmlunit"), repos = "https://cinc.rud.is").

First, we’ll try just plain ol’ {rvest}:

library(rvest) pg <- read_html("http://quotes.toscrape.com/js/") html_nodes(pg, "div.quote")
## {xml_nodeset (0)}

Getting no content back is to be expected since no javascript is executed. Now, we’ll use {htmlunit} to see if we can get to the actual content:

library(htmlunit)
library(rvest)
library(purrr)
library(tibble) js_pg <- hu_read_html("http://quotes.toscrape.com/js/") html_nodes(js_pg, "div.quote")
## {xml_nodeset (10)}
## [1] 
\r\n \r\n “The world as we h ... ## [2]

\r\n \r\n “It is our choices ... ## [3]

\r\n \r\n “There are only tw ... ## [4]

\r\n \r\n “The person, be it ... ## [5]

\r\n \r\n “Imperfection is b ... ## [6]

\r\n \r\n “Try not to become ... ## [7]

\r\n \r\n “It is better to b ... ## [8]

\r\n \r\n “I have not failed ... ## [9]

\r\n \r\n “A woman is like a ... ## [10]

\r\n \r\n “A day without sun ...

I loaded up {purrr} and {tibble} for a reason so let’s use them to make a nice data frame from the content:

tibble( quote = html_nodes(js_pg, "div.quote > span.text") %>% html_text(trim=TRUE), author = html_nodes(js_pg, "div.quote > span > small.author") %>% html_text(trim=TRUE), tags = html_nodes(js_pg, "div.quote") %>% map(~html_nodes(.x, "div.tags > a.tag") %>% html_text(trim=TRUE))
)
## # A tibble: 10 x 3
## quote author tags ## ## 1 “The world as we have created it is a process of our thinking. … Albert Einste… < code data-cf-modified-176a0a4ffbdf3f88625dcded-="">

To be fair, we didn’t really need {htmlunit} for this site. The javascript data comes along with the page and it’s in a decent form so we could also use {V8}:

library(V8)
library(stringi) ctx <- v8() html_node(pg, xpath=".//script[contains(., 'data')]") %>% # target the 
Favorite

Leave a Comment