0
votes

I've been trying to import a long text file generated from a PDF reader application (SODA-PDF). Source document is a script in PDF format.

The convertged text files look ok in note pad, but I get a variety of errors when trying to read the file into a string and manipulate it.

None of the following methods which I've seen in various threads seem to work:

  clean1=Iconv.conv('ASCII//IGNORE', 'UTF8', s)

or

  clean1=s.encode('UTF-8', invalid: :replace, undef: :replace, replace: '', UNIVERSAL_NEWLINE_DECORATOR: true)

or

  clean1=s.gsub(/[\u0080-\u00ff]/,"")

The first method, using Iconv gives

Iconv::InvalidEncoding: invalid encoding ("ASCII", "UTF8")

when invoked.

The second method appears to work, but fails on various string manipulations like

lines= s.split("\n") unless s.blank?

with

 ArgumentError: invalid byte sequence in UTF-8

(Either split or blank? will throw the exception.)

The 3rd method also fails with the 'invalid byte sequence in UTF-8' error.

I am quite hazy on the whole character encoding thing, so excuse any obvious stupidity here.

I'm going to try a character by character filtering, but that's kind of pain since the docs I am working with can be 100+ pages, and I'm hoping there's an easier solve.

Env: Win7 64/ ruby 1.9.3p484 (2013-11-22) [i386-mingw32] / Rails 4.0.3

1

1 Answers

0
votes

I discovered that my source file was encoded in ISO-8859-1. Was able to convert to UTF-8 and it all works fine now.