0
votes

I file.readline() some registry file in order to filter some substrings out. I am making a copy of it (just to preserve original) using shutil.copyfile(), processing by foo() and see nothing filtered out. Tried debugging and the contents of lines are very binary:

'˙ţW\x00i\x00n\x00d\x00o\x00w\x00s\x00 \x00R\x00e\x00g\x00i\x00s\x00t\x00r\x00y\x00 \x00E\x00d\x00i\x00t\x00o\x00r\x00 \x00V\x00e\x00r\x00s\x00i\x00o\x00n\x00 \x005\x00.\x000\x000\x00\n'

which is rather obvious, but was not aware of this (Notepad++ neaty presentation of text). My question is: how can I filter my strings out? I see two options, which are reg->txt->reg approach (what I meant by the title) or converting there strings to bytes and then compare them with contents.

When I create files by hand (copy and paste contents of input file) and give them .txt, then everything works fine, but I wish it could be automated.

inputfile = "filename_in.reg"
outputfile = "filename_out.reg"
copyfile(inputfile, output file)

with open(outputfile, 'r+') as fd:
    contents = fd.readlines()
    for d in data:
        foo(fd, d, contents)
1
This is totally it. You might want to add it as a response, so I may accept it. Thank you so much - small thing, but much appreciated:) - Radoslaw Dubiel
Done - only made it a comment because I wasn't sure that would fix it - but I hoped it would at least get you started. - Martin Bonner supports Monica

1 Answers

0
votes

Reg files are usually UTF-16 (usually referred to in MS documentation as "Unicode". It looks like your debug is treating the data as 8-bit characters (so there are lots of \x00 for the high order bytes of the 16-bit characters). Notepad++ can be persuaded to display UTF-16.

The fix is to tell Python that the text you are reading is in UTF-16 format:

open(outputfile, 'r+', encoding='utf16')