Bash/Linux Find non-ASCII character in a .txt file and replace it with an ASCII character

Question

I have a list of files offloaded from oceanographic instruments. For some reason, there is occasionally a non-ASCII character inserted where an ASCII character should be. I have found grave-E (È) where there should be a W to denote the western hemisphere in longitude records.

Here's what the data looks like:

CUMSECS Date UTC    Time UTC    Date Local  Time local  Z (m)   Target Z    Z Bot   Temp    PAR Salin   Ang VelX    Ang VelY    Ang VelZ    Pump +  Pump -  Gctr    Fix secs    Date UTC    Time UTC    Date Local  Time Local  Lat LatD    Latm        Lon LonD    Lonm        DOP Temp    PAR Salin   Batt V      CMD secs    Date Local  Time Local  No. Cmds
526068034   09/01/16    18:00:34    09/01/16    11:00:34     3.75    2.69    
3.75     0.29    0.000000    0.00   -12 -70 -50 0   5   10
526068039   09/01/16    18:00:39    09/01/16    11:00:39     3.75    2.69    
3.75     0.29    0.000000    0.00   -12 -70 -50 0   5   10
526068044   09/01/16    18:00:44    09/01/16    11:00:44     3.74    2.69    
3.75     0.29    0.000000    0.00   -12 -70 -50 0   5   10
526068049   09/01/16    18:00:49    09/01/16    11:00:49     3.73    2.69    
3.75     0.29    0.000000    0.00   -30732  13588   31909   60399   7538    -82
543622771   03/23/17    22:19:31    03/23/17    15:19:31    38.31877    38  
19.1262 N   123.07136   123  4.2812 È   23.6    115.06     0.0000   96.00   
121.718 
547764151   05/10/17    20:42:31    05/10/17    13:42:31     0.03   16.00   
127.00  13.68   1074.904320 33.56   -4908   -3976   261 1   0   0
547764152   05/10/17    20:42:32    05/10/17    13:42:32     0.00   16.00   
127.00  13.68   1074.904320 33.56   -4908   -3976   261 1   0   0

I can find the non-ASCII characters using the following Bash line pcregrep -n '[^\x00-\x7F]' 170510_ocean_Copepod.txt

I would like to loop through a series of files, find these characters, and replace them with a 'W' so that I can subsequently read them into R and process them en masse. Alternatively, a workaround to the error returned by R in trying to read these files ("multibyte string in location...") would be equally effective for my purposes. Any help much appreciated.

pcregrep -n '[^\x00-\x7F]' 170510_ocean_Copepod.txt | sed 's/[^\x00-\x7F]/W/g' but that returns an error on the sed call for an illegal byte sequence — Connor Dibble
Have you tried to change the fileEncoding argument of read.table? — Scarabee
I have tried the fileEncoding and Encoding routes in R (explicitly calling it latin1 or utf8), but to no avail. My understanding of the encoding issues may be limited, but as far as I can tell it's not really an encoding problem. Perhaps I'm wrong- any ideas? — Connor Dibble
So I never could get the tr method to work- it always returns a an "error: illegal byte sequence". But, I used iconv in the fashion suggested by Kind Stranger, which was successful. In the end, I did not replace the characters, but was able to get the encoding recognizable by R so that I can batch process files where those little multibyte characters are hidden. If anyone has any ideas on how to actually replace the characters (or why I am getting such an error in a MacOSX bash terminal session), that would help me make my code more robust. For now, my research remains in one hemisphere. — Connor Dibble

Kind Stranger Kind Stranger · Accepted Answer · 2017-06-21T00:00:30

I think the problem is that È in utf-8 is a multibyte character consisting of \xc3 and \x88 and sed can't seem to deal with that for whatever reason. As @Jack suggested, tr might be a better tool for the job (tested in bash for windows which doesn't have pcregrep):

user@PC:~$ grep -P '[^\x00-\x7f]' | tr 'È' 'W'
19.1262 N   123.07136   123  4.2812 WW   23.6    115.06     0.0000   96.00

Notice that it does convert both bytes separately to W.

Another method could be to convert the whole file using iconv. iso-8859-15 (latin-9) is one example of single-byte character encoding. The command to convert the file using iconv would be:

iconv -f utf-8 -t iso-8859-15 -o <converted-file> <input-file>

Bash/Linux Find non-ASCII character in a .txt file and replace it with an ASCII character

2 Answers