package (an interface to the urlscan.io API) is now at version 0.2.0 and supports urlscan.io’s authentication requirement when submitting a link for analysis. The service is handy if you want to learn about the details — all the gory technical details — for a website.
For instance, say you wanted to check on
r-project.org. You could manually go to the site, enter that into the request bar and wait for the result:
Or, you can use R!. First pick your preferred social coding site, read through the source (this is going to be new advice for every post starting with this one. Don’t blindly trust code from any social coding site) an then install the
devtools::install_git("https://git.sr.ht/~hrbrmstr/urlscan") # or devtools::install_gitlab("hrbrmstr/urlscan") # or devtools::install_github("hrbrmstr/urlscan")
Next, head on over back to urlscan.io and grab an API key (it’s free). Stick that in your
URLSCAN_API_KEY and then
readRenviron("~/.Renviron") in your R console.
Now, let’s check out
library(urlscan) library(tidyverse) rproj <- urlscan_search("r-project.org") rproj ## URL Submitted: https://r-project.org/ ## Submission ID: eb2a5da1-dc0d-43e9-8236-dbc340b53772 ## Submission Type: public ## Submission Note: Submission successful
There is more data in that
rproj object but we have enough to get more detailed results back. Note that site will return an error when you use
urlscan_result() if it hasn’t finished the analysis yet.
rproj_res <- urlscan_result("eb2a5da1-dc0d-43e9-8236-dbc340b53772", include_shot = TRUE) rproj_res ## URL: https://www.r-project.org/ ## Scan ID: eb2a5da1-dc0d-43e9-8236-dbc340b53772 ## Malicious: FALSE ## Ad Blocked: FALSE ## Total Links: 8 ## Secure Requests: 19 ## Secure Req %: 100%
rproj_res holds quite a bit of data and makes no assumptions about how you want to use it so you will need to do some wrangling with it to find out. The
rproj_res$scan_result entry contains entries with the following information:
task: Information about the submission: Time, method, options, links to screenshot/DOM
page: High-level information about the page: Geolocation, IP, PTR
lists: Lists of domains, IPs, URLs, ASNs, servers, hashes
data: All of the requests/responses, links, cookies, messages
meta: Processor output: ASN, GeoIP, AdBlock, Google Safe Browsing
stats: Computed stats (by type, protocol, IP, etc.)
Let’s see how many domains the R Core folks are allowing to track you:
curlparse::domain(rproj_res$scan_result$lists$urls) %>% # you can use urltools::domain() instead of curlparse table(dnn = "domain") %>% broom::tidy() %>% arrange(desc(n)) ## # A tibble: 7 x 2 ## domain n ## ## 1 platform.twitter.com 7 ## 2 www.r-project.org 5 ## 3 pbs.twimg.com 3 ## 4 syndication.twitter.com 2 ## 5 ajax.googleapis.com 1 ## 6 cdn.syndication.twimg.com 1 ## 7 r-project.org 1
Since I added the
include_shot = TRUE option, we also get a page screenshot back (as a
There’s tons of metadata to explore about web sites by using this package so jump in, kick the tyres, have fun! and file issues/PRs as needed.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…