1
votes

on my site I allow for direct text file uploads. These files are then stored on the server, and displayed on the website. I use UTF-8 on the site.

Now I run into trouble when people upload non-UTF-8 files which contain special chars, such as é.

I've been doing some testing. Made 2 text files, both containing the same word fiancée. One encoded UTF-8 and one encoded ISO 8859-2.

The UTF-8 one uploads fine, and shows the text correct, but the ISO 8859-2 shows as fianc�e.

Now I've tried to detect the uploaded file content with mb_detect_encoding, but whatever file I throw at it, it always detect UTF-8.

I noticed that I can use utf8_encode to convert the ISO 8859-2 files to valid UTF-8, but this only works on non-UTF files. And as I currently cannot detect non-UTF files, I cannot use the utf8_encode function, as it messes up valid UTF-8 files.

Hope this makes sense :)

So my question is, how can I detect files that are for sure not UTF-8 encoded to start with, so that I can use the utf8_encode function on them.

1

1 Answers

3
votes

You cannot. Welcome to encodings.

Seriously though, files are just binary blobs. The bits and bytes in the file could mean anything at all; it could be images, CAD data or, perhaps, text. It depends on how you interpret the bytes. For text files that specifically means with which encoding you interpret them. There's nothing in the files themselves that tells you the correct encoding, you have to know it. Typically you want to know it from metadata accompanying the file. In the case of random user uploads though, there is no metadata, and/or it wouldn't be reliable. So you cannot "know".

The next step would be to guess, but that is obviously not foolproof. You can rule out certain encodings, for example if a file does not validate as UTF-8 (mb_check_encoding($data, 'UTF-8') == false), then it cannot be UTF-8. However, any single byte encoding will validate as any other single byte encoding. It's impossible to distinguish ISO-8859-1 from ISO-8859-2 this way, the bytes are equally valid in both. It's just that the characters that show up may not be the ones you want. To detect that automatically you need a statistical language analyser which can tell you that this character probably shouldn't show up in that word for it to be grammatical. Obviously for that to work you need to know the language used in the file, or you need to detect that first… And even then this is hardly foolproof.

The sanest way is to ask the user. Accept the upload, perhaps do some upfront testing on which encodings can be ruled out, then ask the user which of a bunch of possible encodings the file is in. Present them the result, what the file looks like when interpreted as the chosen encoding, let the user confirm that it looks alright. Many decent text editors do this when you open a file with an ambiguous encoding.