Unknown column? Force encoding of an entire table from “unknown” to “UTF-8” in R on Windows

André Ourednik, Thursday, April 19, 2018

A common knitr issue on Windows

Running R scripts on a Windows machine is equivalent to a dive into enconding hell.

In effect, your non-English data most likely contains characters like Ä, ü, è or š, or even 语言.

In all cases, the only serious way of dealing with these, in fact with any data in an international context, is adopting UTF-8 encoding.

This is why newest R packages like knitr or quanteda work with UTF-8 internally. The problem is: Windows doesn’t. UTF-8 has been around since 1996 and your Windows 10 operating system – unlike Linux and OS/X – most likely runs a Latin-1 or other Widows codepage local behind the scenes. I’ve given up programming R on Windows for that very reason and happily write scripts on Ubuntu ever since, whenever I can. Chances are, though, your corporate environment  runs on a park of Dell machines with Windows installed and you cannot change your OS.

Bottom line: if you try to knit a Rmd script to html or shiny you are very likely to see the following error:

Error in eval(expr, envir, enclos) : unknown column ‘x’

By now, you have already tried everything, like setting the encoding of your data with iconv()  or Encoding() or a stringr()  wrapper around such functions. You examine your vector of characters and see that some have been converted to UTF-8 while others remain “unknown”.

This is normal, bugged behavior. In effect, as mentioned here, Encoding()  “cannot distinguish ASCII from UTF-8 and the bit will not stick even if you set it”:

> txt <- "€ euro" 
> Encoding(txt) 
[1] "UTF-8" 
> txt2 <- "euro" 
> Encoding(txt2) 
[1] "unknown" 
> Encoding(txt2) <- "UTF-8" 
> Encoding(txt2) 
[1] "unknown"

The workaround

This is really terrible but there is a workaround. A very ugly one but that does work: export your data.frame to a CSV temporary file and reimport with data.table::fread() , specifying Latin-1  as source encoding.

package(data.table)
df <- your_data_frame_with_mixed_utf8_or_latin1_and_unknown_str_fields
fwrite(df,"temp.csv")
your_clean_data_table <- fread("temp.csv",encoding = "Latin-1")

Tested with satisfaction on a Windows 10 machine.

  •  
  •  
  •  
  •  
  •  
  •  
Cite as: André Ourednik (2018) « Unknown column? Force encoding of an entire table from “unknown” to “UTF-8” in R on Windows » in Maps and Spaces from https://ourednik.info/maps/2018/04/19/unknown-column-force-encoding-of-an-entire-table-from-unknown-to-utf-8-in-r-on-windows/ [Last-seen March 26th 2019].
Category: Tools