Cleaning up PDFs of pre-1990s scanned texts for text mining in R with Quanteda

André Ourednik, Friday, February 16, 2018

Text sources are often PDF’s. If optical character recognition (OCR) has been applied, the pdftools R package allows you to extract text from all PDFs to text files stored in a folder. The readtext package converts the set of text files into something useful for Quanteda. Nevertheless, some cleaning is necessary before transforming your text into a useful  corpus. Here’s how.

First, you will need the abovementioned packages:

Extracting a txt from a PDF

The extraction needs a single funtion, pdf_text("filepath_and_filename_of_your_PDF") , provided by the pdftools package.

Identifying the encoding of the txt files

Next comes converting your extracted textfiles to a text dataframe. For this to work correctly, you need to identify the encoding of your source texts. Anything beside UTF is basically useless, unless you work on English texts only.  If you are lucky, the texts will already be in UTF-8. Then you can import them like this.

On Windows machines, chances are that pdf_text() will extract your texts in Latin1 encoding or worse, like Windows-1252. Luckily, readtext()  converts any of these to UTF-8 when it imports, but you need to specify the encoding of the original texts, like this:

or like this:

The actual cleaning

The main part is accomplished with regular expressions (RegEx). An excellent site to test your expressions is str_replace_all  is a function from the stringr package.

To remove hard-coded hyphenation:

To remove rubbish characters that are neither letters, numbers or punctuation symbols:

To remove multiple whitespaces:

The most difficult is to join words that have been split in single characters, like “sans i l l u s i o n s quoique” to be replaced by “sans illusions quoique”. Their occurence is due to what seems as a bug of the current version of pdf_text() but can also occur naturaly in older scanned documents. Before computer-based word-processing with bold and itallic features, i.e. up to the 1990s, on writing machines, people used letter spacing a lot to emphasise parts of the text. To turn them back into complete words, RegEx’s look-behind (?<= … ) and negative look-forward (?! …) operators save the day:

And here goes the complete code:

Cite as: André Ourednik (2018) « Cleaning up PDFs of pre-1990s scanned texts for text mining in R with Quanteda » in Maps and Spaces from [Last-seen December 10th 2018].
Category: Tools

Leave a Reply


This site uses Akismet to reduce spam. Learn how your comment data is processed.

No Responses