2
votes

Is there a simple way to check if string is valid UTF-8 sequence in JavaScript?

I really do not want to end with a regular expression like this:

Regex to detect invalid UTF-8 string

P.S.: I am receiving data from external API and sometimes (very rarely but it happens) it returns data with invalid UTF-8 sequences. Trying to put them into PostgreSQL results in an appropriate error.

1
I don't think that really makes any sense. A string is a list of characters. UTF-8 is a way of representing characters in a binary format. A string in itself does not have an encoding. - njzk2
unless you are trying to determine if a string can be represented completely using utf-8 encoding ? - njzk2
the only way to check for a valid UTF8 is to check whether or not it contains invalid utf8 chars. The regex you linked is an effective, concise and efficient way to perform the check. You can, of course, check against your own dictionary, in a custom tuned way. - PA.
I don't know of any built-in method so last time I needed this, I used text.match(/[\x80-\xFF]+/) to gather potential problems, and checked each match against the UTF-8 specification -- 52 lines of code. Using that regexp is actually a pretty neat, fast, and simple way. - Jongware
or you are trying to figure out if a sequence of bytes can be interpreted as an utf-8 encoded string? - njzk2

1 Answers

4
votes

UTF-8 is in fact a simple encoding, but still what you are asking can't be done with a one-liner. You have to:

  1. Override the Content-Type of the response to have a byte array in your script and prevent the browser/library to interpret the response itself
  2. Looping over the bytes to make characters. Note that UTF-8 is a variable-length encoding, and that's why some sequences are invalid.
  3. If an invalid octet is found, skip it
  4. If needed, deserialize the JSON/XML/whatever string to a JavaScript object, possibly by handing failures

Deciding if a certain array is a valid UTF-8 sequence is quite a straightforward task (just a bunch of if statements and bit shiftings), but again it's not a one line thing.