0
votes

The source encoding of .java files in our Maven project which is stored in Subversion mostly ASCII and some files are UTF-8.

I think the intention was that these files would be UTF-8. In the pom file the source encoding is specified as UTF-8.

Now our build fails specifically our SonarQube analysis fails on a .java file which is ISO-8859 and which has a variable with a special character. Using a special character is not a good idea think but that aside, shouldn't the java files have consistent (UTF-8) encoding?

Or does it not matter that most are ASCII and only some are UTF-8? It is the thought that counts?

I btw don't understand how these files end up with ASCII encoding. When I use a IDE or editor like SublimeText files end up as UTF-8.

ASCII I only get when I use NotePad on MS Windows. Java developers do not typically use that for programming.

Should we change the source files to use UTF-8? Or maybe it doens't matter and we can leave this as it is?

As an example. Using MS Windows I create one file using SublimeText and one file using Notepad.exe. I put text 1234Ï in those files. The text contains a special character I with two dots.

When I look at these file on Linux using file

ostraaten@io:/tmp/iconv$ file sublimtext.txt 
sublimtext.txt: UTF-8 Unicode (with BOM) text, with no line terminators
ostraaten@io:/tmp/iconv$ file notepad.txt 
notepad.txt: ISO-8859 text, with no line terminators
ostraaten@io:/tmp/iconv$ 

So this shows Notepad saved the file as ISO-8859 regardless of the contents. When I check the files using iconv

ostraaten@io:/tmp/iconv$ iconv -f UTF-8 notepad.txt -o /dev/null 
iconv: incomplete character or shift sequence at end of buffer
ostraaten@io:/tmp/iconv$ iconv -f UTF-8 sublimtext.txt -o /dev/null 
ostraaten@io:/tmp/iconv$ 

I can open and save the file notepad.txt using SublimeText, the encoding still shows up as ISO-8859.

The character does display correctly in both files. So this support the idea that somewhere the editor tries to determine encoding from the contents of the file. But somewhere else the file is still marked and recognized as ISO-8859.

I can change the encoding using iconv

ostraaten@io:/tmp/iconv$ iconv -f ISO-8859-15 -t UTF-8 notepad.txt > notepad-utf8.txt
ostraaten@io:/tmp/iconv$ file notepad-utf8.txt 
notepad-utf8.txt: UTF-8 Unicode text, with no line terminators
ostraaten@io:/tmp/iconv$ 
straaten@io:/tmp/iconv$ iconv -f UTF-8 notepad-utf8.txt -o /dev/null

The conversion was successful because the message incomplete character is gone.

1
UTF-8 is compatible with ASCII. Any file that contains ASCII characters is also a valid UTF-8 file. (That's one of the reasons why UTF-8 is great for almost everything). Also, the character encoding isn't a property of the file itself. It's detected from the content of a file.Jesper
When I create a file with SublimeText with a few ordinary characters it does show up as UTF-8. A file created with Notepad on MS Windows with the same characters shows up as ISO-8859.onknows
That's because the editors are just guessing what the encoding is, and they choose one that seems to fit with one of these encodings. "Ordinary text" can equally validly be interpreted as ASCII, UTF-8 or ISO-8859-1. It depends on the editor's guess of what the appropriate encoding is, different editors may have different rules to guess the encoding from the content.Jesper

1 Answers

2
votes

Seven bits ASCII is a subset of UTF-8. ISO-8859-1 is Latin 1 with some 8 bits problematic bytes.

So someone worked around UTF-8 with editor or IDE. Some version control checkins substitute text back into the source, but in your case that seems not to be the case.

UTF-8 is a solid choice, though needs some care.