Note - I'm posting this as a 'Q & A' as I haven't found an existing question on SO that matches the specific scenario of deserializing JSON from a Google Cloud pub/sub stream and preserving the UTF-8 character set. I have a solution for this and I want to post an answer to make it available to the community (see https://stackoverflow.com/help/self-answer):
If you have a question that you already know the answer to, and you would like to document that knowledge in public so that others (including yourself) can find it later, it's perfectly okay to ask and answer your own question on a Stack Exchange site.
I'm receiving JSON from a Google Cloud pub/sub URL, and I know that it's using UTF-8 encoding. I can see this by examining the response I get when I make a request directly to the pub/sub URL using Fiddler
I can deserialize the JSON like this (using the Google Gson library):
URL myUrl= new URL("myUrl");
HttpURLConnection connection = (HttpURLConnection) myUrl.openConnection();
MyResponseObject myResponseObject;
try {
myResponseObject = new Gson()
.fromJson(new BufferedReader(new InputStreamReader(connection.getInputStream())), MyResponseObject.class);
}
When I inspect myResponseObject in Eclipse, some of the characters in the JSON that are outside of the ASCII character set aren't correctly displayed.
Then, after I add the resulting dataset into BigQuery, I see characters like this in the BigQuery data, in place of certain characters that don't belong to the ASCII set.
��
The '�' is an indicator that means that the encoding hasn't been correctly handled and that some text encoding has been lost. How do I preserve the encoding?