1
votes

I have uncompressed a PDF file with pdftk and I am trying to edit it in Emacs with regexp.

The problem is that this file has accented characters and Emacs displays them as octal sequences: e.g. \340 for à. To edit this file I have two possibilities (at least I think so).

a) Apply an encoding such that Emacs will display actual accented characters and not their octal equivalent. Vim already displays accented characters properly;

b) Search octal sequences with regexps.

As for a), I have tried (set-buffer-file-coding-system 'utf-8-dos), (set-buffer-file-coding-system 'utf-8-unix), (set-buffer-file-coding-system 'raw-text) without success.

As for b), after applying set-buffer-file-coding-system, I am able to incremental search for the octal sequences with the C-q ... RET, but I am unable to do what I really need: replacing strings. In fact C-q ... RET, does not match octal sequences when using M-% or C-M-%. C-x 8 `... doesn't work either.

Thanks in advance. Antonio

2
can you upload a sample PDF somewhere? - user4815162342
Newbie here, hope it is possible to post links. Anyway I just created a one line test file: filedropper.com/test_16 . In Emacs have a look at line 47 and note how you can manually replace \340 with à, save and reopen it in your PDF viewer. - antonio
A single high-bit octal character is most certainly not UTF-8. Try with CP1252 or perhaps CP850. - tripleee

2 Answers

1
votes

Try the following key-sequence in the buffer visiting the PDF file:

C-x RET r character-coding RET

This will revisit the file using the character-encoding you specify.

Alternatively, if you want to specify the character encoding to use before visiting a file, you can do

C-x RET c character-coding RET

immediately before typing C-x C-f.

See the documentation for more details.

0
votes

@Stefan

Actually I was not speaking about a difference in saving, but in displaying.

In both cases closing and reopening the file leaves the file as is, with no apparent changes. As for displaying, with (set-buffer-file-coding-system 'windows-1252-unix) the mode line changes from (Unix) --- to (Unix) **-, signaling that no change in code system occurred and in fact the characters in the buffer are the same (octal sequences are still there).

When using (revert-buffer-with-coding-system 'windows-1252-unix), mode line changes from (Unix) --- to * (Unix) --- signaling that the code system has changed to windows-12**, according to M-x list-coding-systems mnemonic and in fact the octal sequences are displayed with their equivalent accented characters.

If I apply (set-buffer-file-coding-system 'windows-1252-unix) to other buffers, for example the scratch, the latter changes from 1\-- to * (Unix) **. So for this buffer there is an actual and advertised change from latin-1-dos to windows-1252-unix.

There might well be a coherent design in this, of which I am not aware.

Antonio