Ruby: Limiting a UTF-8 string by byte-length

15

votes

Queue names may be up to 255 bytes of UTF-8 characters.

In ruby (1.9.3), how would I truncate a UTF-8 string by byte-count without breaking in the middle of a character? The resulting string should be the longest possible valid UTF-8 string that fits in the byte limit.

rubystringutf-8byterabbitmq

20

votes

For Rails >= 3.0 you have ActiveSupport::Multibyte::Chars limit method.

From API docs:

- (Object) limit(limit)

Limit the byte size of the string to a number of bytes without breaking characters. Usable when the storage for a string is limited for some reason.

Example:

'こんにちは'.mb_chars.limit(7).to_s # => "こん"

10

votes

bytesize will give you the length of the string in bytes while (as long as the string's encoding is set properly) operations such as slice won't mangle the string.

A simple process would be to just iterate through the string

s.each_char.each_with_object('') do|char, result| 
  if result.bytesize + char.bytesize > 255
    break result
  else
    result << char
  end
end

If you were being crafty you'd copy the first 63 characters directly since any unicode character is at most 4 bytes in utf-8.

Note that this is still not perfect. For example, imagine that the last 4 bytes of your string are the characters 'e' and combining acute accent. Slicing the last 2 bytes produces a string that is still utf8 but in terms of what the user sees would change the output from 'é' to 'e', which could change the meaning of the text. This is probably not a huge deal when you're just naming RabbitMQ queues but could be important in other circumstances. For example, in French a newsletter headline reading 'Un policier tué' means 'A policeman was killed' whereas 'Un policier tue' means 'A policeman kills'.

5

votes

I think I found something that works.

def limit_bytesize(str, size)
  str.encoding.name == 'UTF-8' or raise ArgumentError, "str must have UTF-8 encoding"

  # Change to canonical unicode form (compose any decomposed characters).
  # Works only if you're using active_support
  str = str.mb_chars.compose.to_s if str.respond_to?(:mb_chars)

  # Start with a string of the correct byte size, but
  # with a possibly incomplete char at the end.
  new_str = str.byteslice(0, size)

  # We need to force_encoding from utf-8 to utf-8 so ruby will re-validate
  # (idea from halfelf).
  until new_str[-1].force_encoding('utf-8').valid_encoding?
    # remove the invalid char
    new_str = new_str.slice(0..-2)
  end
  new_str
end

Usage:

>> limit_bytesize("abc\u2014d", 4)
=> "abc"
>> limit_bytesize("abc\u2014d", 5)
=> "abc"
>> limit_bytesize("abc\u2014d", 6)
=> "abc—"
>> limit_bytesize("abc\u2014d", 7)
=> "abc—d"

Update...

Decomposed behavior without active_support:

>> limit_bytesize("abc\u0065\u0301d", 4)
=> "abce"
>> limit_bytesize("abc\u0065\u0301d", 5)
=> "abce"
>> limit_bytesize("abc\u0065\u0301d", 6)
=> "abcé"
>> limit_bytesize("abc\u0065\u0301d", 7)
=> "abcéd"

Decomposed behavior with active_support:

>> limit_bytesize("abc\u0065\u0301d", 4)
=> "abc"
>> limit_bytesize("abc\u0065\u0301d", 5)
=> "abcé"
>> limit_bytesize("abc\u0065\u0301d", 6)
=> "abcéd"

1

votes

How about this:

s = "δogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδog"
count = 0
while true
  more_truncate = "a" + (255-count).to_s
  s2 = s.unpack(more_truncate)[0]
  s2.force_encoding 'utf-8'

  if s2[-1].valid_encoding?
    break
  else
    count += 1
  end
end

s2.force_encoding 'utf-8'
puts s2

1

votes

Rails 6 will provide a String#truncate_bytes that behaves like truncate, but takes a byte count instead of a character count. And, of course, it returns a valid string (it does not cut blindly in the middle of a multibyte char).

Taken from the doc:

>> "🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪".size
=> 20
>> "🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪".bytesize
=> 80
>> "🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪🔪".truncate_bytes(20)
=> "🔪🔪🔪🔪…"

0

votes

Without Rails

Fredrick Cheung's answer is an excellent O(n) starting point that inspired this O(log n) solution:

def limit_bytesize(str, max_bytesize)
  return str unless str.bytesize > max_bytesize

  # find the minimum index that exceeds the bytesize, then subtract 1
  just_over = (0...str.size).bsearch { |l| str[0..l].bytesize > max_bytesize }
  str[0..(just_over - 1)]
end

I believe this also achieves the automatic max_bytesize / 4 speedup mentioned in that answer, since bsearch starts in the middle.

Ruby: Limiting a UTF-8 string by byte-length

6 Answers

Without Rails