0
votes

I would like to test some file character encoding detection functionality, where I input files of type UTF-8, windows-1252, ISO-8859-1, etc.

I also want to input files with unknown character encoding so that the user can be alerted.

I haven't found a good way to create files with an unknown or undetectable character encoding.

2
What is an unknown encoding? I presume you don't tell the detection utility what the real encoding is, so any encoding is unknown before detection. - lenz

2 Answers

1
votes
head -c1024 /dev/random > /tmp/badencoding

This is almost certainly what you want in practice (1kB of random data), but there isn't really a good definition of "undetectable character encoding." This random file is legal 8-bit ASCII. The fact that it certainly is not meant to be 8-bit ASCII is just a heuristic. So all you're going to wind up doing is testing that your algorithm works in ways that your users probably want it to; there is no ultimate "correct" here without reading the mind of the person who created the file.

0
votes

An empty text file has an undetectable character encoding (except if it has a Unicode BOM).

But basically, you either have to require the user to tell which character encoding a file they are giving you uses, or tell them which one to use (or both, if you specify a default but allow it to be overridden [which is what many compilers do.]).

You can then test the contents for validity against the agreed character encoding. This will catch some errors but note that many character encodings allow any sequence of bytes with any value so any content is always valid (even if the character encoding is not what was used to write the file).

You can then test for consistency with expected values, such as some syntax or allowable character or words, to catch more errors (but you wouldn't necessarily be able to say the character encoding didn't match; it could be just the content is incorrect).

To create files with different character encodings, you could write a program or use a 3rd-party program such as iconv or PowerShell.

If you want an unknown character encoding, just generate a random integer map, convert a file, discard the map, and then not even you will know it.

Ultimately, text files are too technical for users to deal with. Give them some other option such as an open document or spreadsheet format such as .odt, .docx, .ods, or .xlsx. These are very easy to read by programs.