Text Mining: Detect Strings: Very Fast Word Lookup in a Large Dictionary in R with data.table and matrixStats

Looking up words in dictionaries is the alpha and omega of text mining. I am, for instance interested to know whether a given word from a large dictionary (>100k words) occurs in a sentence or not, for a list of over 1M sentences.

The best take at this task is using the Julia language, but R can do a decently fast job too (since many of its packages run efficient C++ code under the hood). Like Julia, it really depends on the way you code. Let us first load required libraries and create fictional data, consisting of 1M lines of data and a dictionary of 200k words :

library("data.table")
library("magrittr")
library("matrixStats")

randomString <- function(n = 5000) {
  a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
  return(a)
}
text.data <- data.table(id = 1:1000000)
text.data[, text := {randomString(nrow(text.data))} ]
text.data[, textlong := sapply(id, function(x) randomString(round(runif(1,1,10))) %>% paste(.,collapse=" ")) ]
reference  <- data.table(ref = sample(unique(text.data$text),200000))

Looking up single words

The column of our table text.data$text consists of single words only. The fastest way to look up a single word, i.e. a string of characters, in a vector of words in R is using the %chin% operator provided by data.table. Even then, depending on the way you use it, a significant gain can be achieved. Prefer the := operator within data.table to the <- assignment.

system.time({
  text.data$inlist <- text.data$text %chin% reference$ref
})
# user  system elapsed 
# 0.616   0.008   0.625 

system.time({
  text.data[,inlist  := text %chin% reference$ref]
})
# user  system elapsed 
# 0.050   0.001   0.052 
# This is very fast. Code language: PHP (php)

Looking up the Presence of Words within Sentences

The column text.data$textlong consists of whole phrases of words. To look up words within the phrase, you first need to split it. For that purpose, you can use for example stringr::str_split(x,"[ ]") (In my example, words are separated by spaces only. You can, of course, include splitting by commas, points, parantheses and other elements, for instance: stringr::str_split(x,"[ ,;:./\$\$]") ).

A Parallelizing Attempt

My first reflex was to try parallelize with foreach and doParallel. The function below turned out to be horribly slow (hours); don’t even bother trying.

library("foreach")
library("doParallel")
library("stringr")

detectnameinstringp <- function (x,dict) {
  splitedv <- str_split(x,"[ ]")
  cl <- parallel::makeCluster(4) # 
  doParallel::registerDoParallel(cl)
  out <- foreach(i=1:length(splitedv),.export = "reference",.packages = c("magrittr","data.table")) %dopar% {
    unlist(splitedv[i]) %>% tolower %>% .[.!=""]  %>% chmatch(., dict, nomatch=FALSE) %>% any(.)
  }
  stopCluster(cl)
  return(unlist(out))
}
# You can try this out, but your R session will be busy for hours....
system.time({
  text.data[ , dicmatchinphrase := detectnameinstringp(textlong,reference$ref)]
}) Code language: PHP (php)

An Attempt with Quanteda: OK but not Excellent

Quanteda is a very efficient library for text mining. Using either quanteda::tokens_lookup or quanteda::dfm_lookup speeds up the process considerably, bringing it from hours to less than a minute. Nevertheless, it remains 36 times slower than my solution presented in the last section below.

tokens_lookup

To be fair, Quanteda’s tokens_lookup function was not designed for the purpose I use it for here: it actually constructs a vast object listing all dictionary entries grouping set of words. As states Kohei Watanabe, “tokens_lookup() is a very complex function designed to handle overlapping and nested phrases correctly.” In my case, I only use one entry in my dictionary, encompassing all the words from reference$ref ; which works, but that is, of course, not the way Quanteda conceived the use of its dictionaries. From the beautiful object constructed by Quanteda, I only retain the number of occurrences of the single entry of my dictionary. This value can be either 1 or 0.

library("quanteda")

detectnameinstringquanteda <- function(x,dict){
  mytoks <- tokens(x, what="fastestword")
  mydict <- dictionary(list(a=dict))
  tokslookup <- tokens_lookup(mytoks,mydict,valuetype = "fixed")
  ntoken(tokslookup) == 1
}

system.time({
  text.data[ , dicmatchinphrase := detectnameinstringquanteda(textlong,reference$ref)]
}) 
#   user  system elapsed 
# 52.816   3.279  50.173 Code language: PHP (php)

dfm_lookup

A similar approach consists in using Quanteda’s dfm_lookup function. The bottleneck here is that you can only apply dfm_lookup to a document-feature matrix, whose construction takes a long time:

detectnameinstringquanteda <- function(x,dict){
  mydfm <- tokens(x, what="fastestword") %>% dfm
  mydict <- dictionary(list(a=dict))
  lookedup <- dfm_lookup(mydfm,mydict,valuetype = "fixed")
  ntoken(lookedup) == 1
}

system.time({
  text.data[ , dicmatchinphrase := detectnameinstringquanteda(textlong,reference$ref)]
}) 
#   user  system elapsed 
# 59.103   2.802  59.982 Code language: PHP (php)

The Wonder of data.table::strsplit and matrixStats::rowAnys

To achieve a much faster result, I’ve turned to data.table::strsplit, that not only splits a phrase but stores all its words in a data.table (inserting NAs at the end wherever needed if, as most probable, your phrases do not all have the same number of words; only your longest phrase will not have NAs in the last column(s)). This allows me to avoid loops (for, sapply, lapply etc.) altogether – including parallelized foreach loops – and to run %chin% on whole vectors, column by column.

Since I am interested to know whether a given word occurs in a phrase or not, I combine all lookup results from each row with rowAnys (basically the any function applied over each row of a data.table converted to a matrix). The approach makes a great difference:

detectnameinstring <- function(x,dict) {
  splitedv <- tstrsplit(x,"[ ]") %>% as.data.table
  cols <- colnames(splitedv)
  for (j in cols) set(splitedv, j = j, value = splitedv[[j]] %chin% dict)
  return(as.matrix(splitedv) %>% rowAnys)
}

system.time({
  text.data[ , dicmatchinphrase2 := detectnameinstring(textlong,reference$ref)]
})
# user  system elapsed 
# 9.855   0.004   9.905 Code language: PHP (php)

And cut the time down again by parallel processing !

Last, the call to detectnameinstring can itself be parallelized. To do so, split the analysis to send your data to the detecnameinstring() function by packs of n rows and wrap the calls in a foreach() %dopar% {} construct. Here a test with 8 cores:

detectnameinstringarallel <- function(x,dict,rowsplit) {
  if (length(x)<rowsplit) {
    return(detectnameinstring(x,dict))
  } else {
    myseq <- seq(0,length(x),rowsplit)
    if( length(x) %% rowsplit > 0) myseq <- c(myseq,length(x))
    cl <- parallel::makeCluster(4) 
    doParallel::registerDoParallel(cl)
    foreach(
      i=1:(length(myseq)-1),
      .packages = c("stringr","data.table","matrixStats"),
      .export = c("dict","detectnameinstring")
    ) %dopar% {
      detectnameinstring(x[(myseq[i]+1):(myseq[i+1])],dict)
    } %>% unlist
  }
}
system.time({
  text.data[ , dicmatchinphrase := detectnameinstringarallel(textlong,references$ref,10000)]
})  
#   user  system elapsed 
#  1.459   0.334  17.517  Code language: PHP (php)