There are basically two ways to access SPARQL data with R. You can use the generic curl and httr packages, or you can use the SPARQL package.

I recommend the curl+httr solution. The SPARQL package is a nice wrapper, but it dates back to 2013 and it doesn’t offer important parameters for HTTP(S) communication, such as setting proxies or headers, which allow you to specify the format of the returned data (XML,CSV) or to reach SPARQL endpoints from behind corporate firewalls. Also, the SPARQL package relies on RCurl, which has problems with some network settings.

In this post, I show you how to write an .Rmd file that will allow you to execute SPARQL code chunks in RStudio.

Setting Up a Custom Chunk Type

One of the nice things about the knitr package is that it allows you to define custom chunk types. Let us load all necessary libraries and set up a {sparql} type.

library(httr)
library(magrittr)
library(xml2)
library(data.table)

knitr::knit_engines$set(sparql = function(options) {
  code <- paste(options$code, collapse = '\n')
  ep<- options$endpoint
  qm <- paste(ep, "?", "query", "=", gsub("\\+", "%2B", URLencode(code, reserved = TRUE)), "", sep = "")
  proxy_url <- curl::ie_get_proxy_for_url(ep)
  proxy_config <- use_proxy(url=proxy_url)
  varname <- options$output.var
  if(is.null(options$output.type)) {
    output_type <- "csv"
  } else {
    output_type <- options$output.type
  }
  if (output_type=="list") {
    out <- GET(qm,proxy_config,timeout(60)) %>% read_xml() %>% as_list()
    nresults <- length(out$sparql$results)
  } else {
    queryres_csv <- GET(qm,proxy_config,timeout(60), add_headers(c(Accept = "text/csv")))
    out <- queryres_csv$content %>% rawToChar %>% textConnection %>% read.csv
    nresults <- nrow(out)
  }
  chunkout <- ifelse(!is.null(varname),qm,out)
  text <- paste("The SPARQL query returned",nresults,"results")
  if (!is.null(varname)) assign(varname, out, envir = knitr::knit_global())
  knitr::engine_output(options, options$code, chunkout, extra=text)
})

That’s it. Just execute the chunk above and your {sparql} chunks are ready to run. Note a couple of details about the function enclosed in the definition:

The two lines proxy_url <- curl::ie_get_proxy_for_url(ep) and proxy_config <- use_proxy(url=proxy_url) use the curl package to get proxy settings of your machine. This is probably not necessary on your home computer but plays a vital role if you are trying to access the web with R from behind a corporate firewall on a Windows machine; it allows R to fetch the proxy configuration straight from your Internet Explorer.

The line queryres_csv <- GET(qm,proxy_config,timeout(60), add_headers(c(Accept = "text/csv"))) specifies that it wants the SPARQL endpoint to return a CSV instead of XML. Most modern endpoints honor the Accept = "text/csv" header. CSV’s are easy to convert to data.frames in R; XML is easier to convert to a list.

The lines varname <- options$output.var and later if (!is.null(varname)) assign(varname, out, envir = knitr::knit_global()) allow you to output the result of your SPARQL query directly to a variable storing a data.frame or, if you wish so, to a list object.

SPARQL Chunks

Get a data.frame

Thanks to the above definition of a SPARQL engine for the R Markdown chunks, we can now get an array directly by running a chunk. The result is stored in the object named in the output.var parameter.

```{sparql output.var="queryres_csv", endpoint="https://lindas.admin.ch/query"}
PREFIX schema: <http://schema.org/>
SELECT * WHERE {
  ?sub a schema:DataCatalog .
  ?subtype a schema:DataType .
}
```

Get a List

If you prefer a list object to a data.frame, you can specify this by setting output.type="list"

```{sparql output.var="queryres_list", endpoint="https://lindas.admin.ch/query", output.type="list"}
PREFIX schema: <http://schema.org/>
SELECT * WHERE {
  ?sub a schema:DataCatalog .
  ?subtype a schema:DataType .
}
```

A Chunkless Approach

Nothing hinders you, of course, from using inline R script instead of chunks. First, set up your endpoint and query variables:

endpoint <- "https://lindas.admin.ch/query"
proxy_url <- curl::ie_get_proxy_for_url(endpoint)
proxy_config <- use_proxy(url=proxy_url)
query <- "PREFIX schema: <http://schema.org/>
SELECT * WHERE {
?sub a schema:DataCatalog .
?subtype a schema:DataType .
}"
querymanual <- paste(endpoint, "?", "query", "=", gsub("\\+", "%2B", URLencode(query, reserved = TRUE)), "", sep = "")

Then choose your option

Get a data.frame in a Simple R Script

queryres_csv <- GET(querymanual,proxy_config, timeout(60), add_headers(c(Accept = "text/csv")))
queryres_content_csv <- queryres_csv$content %>% rawToChar %>% textConnection %>% read.csv

Get a List in a Simple R Script

queryres <- GET(querymanual,proxy_config,timeout(60))
queryres_content <- queryres %>% read_xml() %>% as_list()
queryres_content$sparql$results[[1]] 

If your endpoint does not accept the Accept = "text/csv" header, you can also retrieve a list first and then parse it into a data.frame. Beware, though, this parsing might need adaptation for the data returned by your specific endpoint:

tableheaders <- sapply(queryres_content$sparql$head,function(x){attributes(x) %>% .$name})
table <- lapply(queryres_content$sparql$results,function(x){
  d <- sapply(x,function(r){
    r$uri[[1]] %>% unlist
  }) %>% t %>% data.table
  colnames(d) <-tableheaders
  d
}) %>% rbindlist()
rm(tableheaders)

PS: Reactivate the Chunk Play Button if Needed

There is a bug in the current version of RStudio (2021.09.1) that might deactivate the “play” button for non-r chunks. To make the play button reappear for sparql chunks (as well as for sql or python chunks), make sure to select the “Chunk Output Inline” option and resave your code.

Leave a comment

Your email address will not be published. Required fields are marked *