I've been trying to import a long text file generated from a PDF reader application (SODA-PDF). Source document is a script in PDF format.
The convertged text files look ok in note pad, but I get a variety of errors when trying to read the file into a string and manipulate it.
None of the following methods which I've seen in various threads seem to work:
clean1=Iconv.conv('ASCII//IGNORE', 'UTF8', s)
or
clean1=s.encode('UTF-8', invalid: :replace, undef: :replace, replace: '', UNIVERSAL_NEWLINE_DECORATOR: true)
or
clean1=s.gsub(/[\u0080-\u00ff]/,"")
The first method, using Iconv gives
Iconv::InvalidEncoding: invalid encoding ("ASCII", "UTF8")
when invoked.
The second method appears to work, but fails on various string manipulations like
lines= s.split("\n") unless s.blank?
with
ArgumentError: invalid byte sequence in UTF-8
(Either split or blank? will throw the exception.)
The 3rd method also fails with the 'invalid byte sequence in UTF-8' error.
I am quite hazy on the whole character encoding thing, so excuse any obvious stupidity here.
I'm going to try a character by character filtering, but that's kind of pain since the docs I am working with can be 100+ pages, and I'm hoping there's an easier solve.
Env: Win7 64/ ruby 1.9.3p484 (2013-11-22) [i386-mingw32] / Rails 4.0.3