Looking up words in dictionaries is the alpha and omega of text mining. I am, for instance interested to know whether a given word from a large dictionary (>100k words) occurs in a sentence or not, for a list of over 1M sentences.
The best take at this task is using the Julia language, but R can do a decently fast job too (since many of its packages run efficient C++ code under the hood). Like Julia, it really depends on the way you code. Let us first load required libraries and create fictional data, consisting of 1M lines of data and a dictionary of 200k words :
library("data.table")
library("magrittr")
library("matrixStats")
randomString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
return(a)
}
text.data <- data.table(id = 1:1000000)
text.data[, text := {randomString(nrow(text.data))} ]
text.data[, textlong := sapply(id, function(x) randomString(round(runif(1,1,10))) %>% paste(.,collapse=" ")) ]
reference <- data.table(ref = sample(unique(text.data$text),200000))
Looking up single words
The column of our table text.data$text
consists of single words only. The fastest way to look up a single word, i.e. a string of characters, in a vector of words in R is using the %chin%
operator provided by data.table. Even then, depending on the way you use it, a significant gain can be achieved. Prefer the :=
operator within data.table to the <-
assignment.
system.time({
text.data$inlist <- text.data$text %chin% reference$ref
})
# user system elapsed
# 0.616 0.008 0.625
system.time({
text.data[,inlist := text %chin% reference$ref]
})
# user system elapsed
# 0.050 0.001 0.052
# This is very fast.
Code language: PHP (php)
Looking up the Presence of Words within Sentences
The column text.data$textlong
consists of whole phrases of words. To look up words within the phrase, you first need to split it. For that purpose, you can use for example stringr::str_split(x,"[ ]")
(In my example, words are separated by spaces only. You can, of course, include splitting by commas, points, parantheses and other elements, for instance: stringr::str_split(x,"[ ,;:./\\(\\)]")
).
A Parallelizing Attempt
My first reflex was to try parallelize with foreach and doParallel. The function below turned out to be horribly slow (hours); don’t even bother trying.
library("foreach")
library("doParallel")
library("stringr")
detectnameinstringp <- function (x,dict) {
splitedv <- str_split(x,"[ ]")
cl <- parallel::makeCluster(4) #
doParallel::registerDoParallel(cl)
out <- foreach(i=1:length(splitedv),.export = "reference",.packages = c("magrittr","data.table")) %dopar% {
unlist(splitedv[i]) %>% tolower %>% .[.!=""] %>% chmatch(., dict, nomatch=FALSE) %>% any(.)
}
stopCluster(cl)
return(unlist(out))
}
# You can try this out, but your R session will be busy for hours....
system.time({
text.data[ , dicmatchinphrase := detectnameinstringp(textlong,reference$ref)]
})
Code language: PHP (php)
An Attempt with Quanteda: OK but not Excellent
Quanteda is a very efficient library for text mining. Using either quanteda::tokens_lookup
or quanteda::dfm_lookup
speeds up the process considerably, bringing it from hours to less than a minute. Nevertheless, it remains 36 times slower than my solution presented in the last section below.
tokens_lookup
To be fair, Quanteda’s tokens_lookup
function was not designed for the purpose I use it for here: it actually constructs a vast object listing all dictionary entries grouping set of words. As states Kohei Watanabe, “tokens_lookup() is a very complex function designed to handle overlapping and nested phrases correctly.” In my case, I only use one entry in my dictionary, encompassing all the words from reference$ref
; which works, but that is, of course, not the way Quanteda conceived the use of its dictionaries. From the beautiful object constructed by Quanteda, I only retain the number of occurrences of the single entry of my dictionary. This value can be either 1 or 0.
library("quanteda")
detectnameinstringquanteda <- function(x,dict){
mytoks <- tokens(x, what="fastestword")
mydict <- dictionary(list(a=dict))
tokslookup <- tokens_lookup(mytoks,mydict,valuetype = "fixed")
ntoken(tokslookup) == 1
}
system.time({
text.data[ , dicmatchinphrase := detectnameinstringquanteda(textlong,reference$ref)]
})
# user system elapsed
# 52.816 3.279 50.173
Code language: PHP (php)
dfm_lookup
A similar approach consists in using Quanteda’s dfm_lookup
function. The bottleneck here is that you can only apply dfm_lookup
to a document-feature matrix, whose construction takes a long time:
detectnameinstringquanteda <- function(x,dict){
mydfm <- tokens(x, what="fastestword") %>% dfm
mydict <- dictionary(list(a=dict))
lookedup <- dfm_lookup(mydfm,mydict,valuetype = "fixed")
ntoken(lookedup) == 1
}
system.time({
text.data[ , dicmatchinphrase := detectnameinstringquanteda(textlong,reference$ref)]
})
# user system elapsed
# 59.103 2.802 59.982
Code language: PHP (php)
The Wonder of data.table::strsplit and matrixStats::rowAnys
To achieve a much faster result, I’ve turned to data.table::strsplit
, that not only splits a phrase but stores all its words in a data.table (inserting NAs at the end wherever needed if, as most probable, your phrases do not all have the same number of words; only your longest phrase will not have NAs in the last column(s)). This allows me to avoid loops (for
, sapply
, lapply
etc.) altogether – including parallelized foreach
loops – and to run %chin%
on whole vectors, column by column.
Since I am interested to know whether a given word occurs in a phrase or not, I combine all lookup results from each row with rowAnys
(basically the any
function applied over each row of a data.table converted to a matrix). The approach makes a great difference:
detectnameinstring <- function(x,dict) {
splitedv <- tstrsplit(x,"[ ]") %>% as.data.table
cols <- colnames(splitedv)
for (j in cols) set(splitedv, j = j, value = splitedv[[j]] %chin% dict)
return(as.matrix(splitedv) %>% rowAnys)
}
system.time({
text.data[ , dicmatchinphrase2 := detectnameinstring(textlong,reference$ref)]
})
# user system elapsed
# 9.855 0.004 9.905
Code language: PHP (php)
And cut the time down again by parallel processing !
Last, the call to detectnameinstring
can itself be parallelized. To do so, split the analysis to send your data to the detecnameinstring()
function by packs of n rows and wrap the calls in a foreach() %dopar% {}
construct. Here a test with 8 cores:
detectnameinstringarallel <- function(x,dict,rowsplit) {
if (length(x)<rowsplit) {
return(detectnameinstring(x,dict))
} else {
myseq <- seq(0,length(x),rowsplit)
if( length(x) %% rowsplit > 0) myseq <- c(myseq,length(x))
cl <- parallel::makeCluster(4)
doParallel::registerDoParallel(cl)
foreach(
i=1:(length(myseq)-1),
.packages = c("stringr","data.table","matrixStats"),
.export = c("dict","detectnameinstring")
) %dopar% {
detectnameinstring(x[(myseq[i]+1):(myseq[i+1])],dict)
} %>% unlist
}
}
system.time({
text.data[ , dicmatchinphrase := detectnameinstringarallel(textlong,references$ref,10000)]
})
# user system elapsed
# 1.459 0.334 17.517
Code language: PHP (php)