2
votes

I'm trying to use Data.Aeson (https://hackage.haskell.org/package/aeson-0.6.1.0/docs/Data-Aeson.html) to decode some JSON strings, however it is failing to parse strings that contain non-standard characters.

As an example, the file:

import Data.Aeson
import Data.ByteString.Lazy.Char8 (pack)

test1 :: Maybe Value
test1 = decode $ pack "{ \"foo\": \"bar\"}"

test2 :: Maybe Value
test2 = decode $ pack "{ \"foo\": \"bòz\"}"

When run in ghci, gives the following results:

*Main> :l ~/test.hs
[1 of 1] Compiling Main             ( /Users/ltomlin/test.hs, interpreted )
Ok, modules loaded: Main.
*Main> test1
Just (Object fromList [("foo",String "bar")])
*Main> test2
Nothing

Is there a reason that it doesn't parse the String with the unicode character? I was under the impression that Haskell was pretty good with unicode. Any suggestions would be greatly appreciated!

Thanks,

tetigi

EDIT

Upon further investigation using eitherDecode, I get the following error message:

 *Main> test2
 Left "Failed reading: Cannot decode byte '\\x61': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream"

x61 is the unicode character for 'z', which comes right after the special unicode character. Not sure why it's failing to read the characters after the special character!

Changing test2 to be test2 = decode $ pack "{ \"foo\": \"bòz\"}" instead gives the error:

Left "Failed reading: Cannot decode byte '\\xf2': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream"

Which is the character for "ò", which makes a bit more sense.

1
Do you really use aeson-0.6.1.0? That being said, this seems like some bad encoding between Text and ByteString. Have you tried to encode the text with encodeUtf8 :: Text -> ByteString (from Data.Text.Encoding) instead?Zeta
@Zeta I just linked the tab I had open at the time - the Installed version is 0.7.0.4. I'll give it a shot!Tetigi
@Zeta That worked! I now get Right (Object fromList [("foo",String "b\242r")]) :)Tetigi

1 Answers

7
votes

The problem is your usage of pack from the Char8 module, which doesn't work with non-Latin 1 data. Instead, use encodeUtf8 from text.