Looking up words in dictionaries is the alpha and omega of text mining. I am, for instance interested to know whether a given word from a large dictionary (>100k words) occurs in a phrase or not, for a list of over 1M phrases.

R can be very slow or much much faster at this task, depending on the way you code. Let us first load required libraries and create fictional data, consisting of 1M lines of data and a dictionary of 200k words :


randomString <- function(n = 5000) {
  a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
text.data <- data.table(id = 1:1000000)
text.data[, text := {randomString(nrow(text.data))} ]
text.data[, textlong := sapply(id, function(x) randomString(round(runif(1,1,10))) %>% paste(.,collapse=" ")) ]
reference  <- data.table(ref = sample(unique(text.data$text),200000))

Looking up single words

The column of our table text.data$text consists of single words only. The fastest way to look up a single word, i.e. a string of characters, in a vector of words in R is using the %chin% operator provided by data.table. Even then, depending on the way you use it, a significant gain can be achieved. Prefer the := operator within data.table to the <- assignement.

  text.data$inlist <- text.data$text %chin% reference$ref
# user  system elapsed 
# 0.616   0.008   0.625 

  text.data[,inlist  := text %chin% reference$ref]
# user  system elapsed 
# 0.050   0.001   0.052 
# This is very fast. 

Looking up the presence of words within phrases

The column text.data$textlong consists of whole phrases of words. To look up words within the phrase, you first need to split it. For that purpose, you can use for example stringr::str_split(x,"[ ]") (In my example, words are separated by spaces only. You can of course include splitting by commas, points, paranthesis and other elements, for instance: stringr::str_split(x,"[ ,;:./\\(\\)]") ).

A parallelizing attempt

My first reflex was to try parellelize with foreach and doParallel.


detectnameinstringv <- function (x) {
  splitedv <- str_split(x,"[ ]")
  cl <- parallel::makeCluster(4) # detectCores() gives me 4 cores, so I use them
  out <- foreach(i=1:length(splitedv),.export = "reference",.packages = c("magrittr","data.table")) %dopar% {
    unlist(splitedv[i]) %>% .[.!=""]  %>% chmatch(., reference$ref, nomatch=FALSE) %>% any(.)

The wonder of data.table::strsplit and matrixStats::rowAnys

The above function turned out to be very slow (hours). To achieve a much faster result, I’ve turned to data.table::strsplit, that not only splits a phrase but stores all its words in a table (inserting NAs at the end wherever needed if, as most probable, your phrases do not all have the same number of words; only your longest phrase will not have NAs in the last column(s)). This allows me to avoid loops (for, sapply, lapply etc.) alltogether – including parallelized foreach loops – and to run %chin% on whole vectors, column by column.

Since I am interested to know whether a given word occurs in a phrase or not, I combine all lookup results from each row with rowAnys (basically the any function applied over each row of a data.table converted to a matrix). The approach makes a great difference:

detectnameinstring <- function(x) {
  splitedv <- tstrsplit(x,"[ ]") %>% as.data.table
  cols <- colnames(splitedv)
  for (j in cols) set(splitedv, j = j, value = splitedv[[j]] %chin% reference$ref)
  return( as.matrix(splitedv) %>% rowAnys )

  text.data[ , dicmatchinphrase := detectnameinstring(textlong)]
# user  system elapsed 
# 9.855   0.004   9.905 

This can of course be parallelized as well, but using explicit parallelization might prevent data.table from using its own, internal, and already implemented parellel processes.

The last function is the fastest I’ve found so far. Any suggestions to further improve speed are welcome.

PS : In case of memory overload

In some cases, if your computer lacks RAM and/or you have many phrases and/or your phrases are long, constructing a data.table might reach memory limit. In that case, you can split the analysis to send your data to the detecnameinstring function by packs of n rows.

# Iterative call to detectnameinstringvdtm. Much slower but lower RAM footprint
detectnameinstringarge <- function(x,rowsplit) {
  if (length(x)<rowsplit) {
  } else {
    myseq <- c(seq(1,length(x),rowsplit),length(x))
    foreach(i=1:(length(myseq)-1)) %do% {
    } %>% unlist
text.data[ , dicmatchinphrase := detectnameinstringarge(textlong,10000)]