85
votes

I'm writing a web application in Google app Engine. It allows people to basically edit html code that gets stored as an .html file in the blobstore.

I'm using fetchData to return a byte[] of all the characters in the file. I'm trying to print to an html in order for the user to edit the html code. Everything works great!

Here's my only problem now:

The byte array is having some issues when converting back to a string. Smart quotes and a couple of characters are coming out looking funky. (?'s or japanese symbols etc.) Specifically it's several bytes I'm seeing that have negative values which are causing the problem.

The smart quotes are coming back as -108 and -109 in the byte array. Why is this and how can I decode the negative bytes to show the correct character encoding?

7
Hi, I know it is a really old post but I am facing similar problems. I am making a man-in-the-middle proxy for ssl. The problem that I am facing is same as yours. I listen to the socket and get the data into InputStream and then into byte[]. Now when I am trying to convert the byte[] into String (I need to use the response body for attacks), I get really funny characters full of smart quotes and question marks and what not. I believe yours problem is same as mine as we both are dealing with html in byte[]. Can you please advice?Parul S
By the way, I went to the extent to find the encoding of my system using Sytem.properties and found it to be "Cp1252". Now, I used String str=new String(buffer, "Cp1252"); but no help.Parul S

7 Answers

141
votes

The byte array contains characters in a special encoding (that you should know). The way to convert it to a String is:

String decoded = new String(bytes, "UTF-8");  // example for one encoding type

By The Way - the raw bytes appear may appear as negative decimals just because the java datatype byte is signed, it covers the range from -128 to 127.


-109 = 0x93: Control Code "Set Transmit State"

The value (-109) is a non-printable control character in UNICODE. So UTF-8 is not the correct encoding for that character stream.

0x93 in "Windows-1252" is the "smart quote" that you're looking for, so the Java name of that encoding is "Cp1252". The next line provides a test code:

System.out.println(new String(new byte[]{-109}, "Cp1252")); 
25
votes

Java 7 and above

You can also pass your desired encoding to the String constructor as a Charset constant from StandardCharsets. This may be safer than passing the encoding as a String, as suggested in the other answers.

For example, for UTF-8 encoding

String bytesAsString = new String(bytes, StandardCharsets.UTF_8);
11
votes

You can try this.

String s = new String(bytearray);
5
votes
public class Main {

    /**
     * Example method for converting a byte to a String.
     */
    public void convertByteToString() {

        byte b = 65;

        //Using the static toString method of the Byte class
        System.out.println(Byte.toString(b));

        //Using simple concatenation with an empty String
        System.out.println(b + "");

        //Creating a byte array and passing it to the String constructor
        System.out.println(new String(new byte[] {b}));

    }

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        new Main().convertByteToString();
    }
}

Output

65
65
A
5
votes
public static String readFile(String fn)   throws IOException 
{
    File f = new File(fn);

    byte[] buffer = new byte[(int)f.length()];
    FileInputStream is = new FileInputStream(fn);
    is.read(buffer);
    is.close();

    return  new String(buffer, "UTF-8"); // use desired encoding
}
4
votes

I suggest Arrays.toString(byte_array);

It depends on your purpose. For example, I wanted to save a byte array exactly like the format you can see at time of debug that is something like this : [1, 2, 3] If you want to save exactly same value without converting the bytes to character format, Arrays.toString (byte_array) does this,. But if you want to save characters instead of bytes, you should use String s = new String(byte_array). In this case, s is equal to equivalent of [1, 2, 3] in format of character.

3
votes

The previous answer from Andreas_D is good. I'm just going to add that wherever you are displaying the output there will be a font and a character encoding and it may not support some characters.

To work out whether it is Java or your display that is a problem, do this:

    for(int i=0;i<str.length();i++) {
        char ch = str.charAt(i);
        System.out.println(i+" : "+ch+" "+Integer.toHexString(ch)+((ch=='\ufffd') ? " Unknown character" : ""));
    }

Java will have mapped any characters it cannot understand to 0xfffd the official character for unknown characters. If you see a '?' in the output, but it is not mapped to 0xfffd, it is your display font or encoding that is the problem, not Java.