0
votes

I decided to build my own CSV parser in elixir as a practice project and managed to get something working without too much hassle.

I know that this was a problem that had been solved in the past by some of the "top" elixir devs so I decided to take a look at how they went about it.

I started looking at the source code for the elixir module NimbleCSV. It was written by José Valim, the creator of the language, with contributions from a few notable elixir devs so I thought that this was a good choice.

In the parse_string function they check for the strings length with the function byte_size(string). I think I understand how this function works. e.g.

iex()> byte_size(<<104, 101, 108, 108, 111>>)
5
iex()> byte_size(<<104, 101, 108, 108, 111::9>>)
6

The first function is 40 bits which is 5 bytes (each value in the binary defaults to 8 bits in elixir if not told otherwise)

In the second I am assigning one of the values to be 9 bits so the total is 41 bits. This means that it is 6 bytes (due to rounding)

sorry if some of the language is not exactly right

That makes sense to me. My, question is why would they choose this function over String.length in this case? If they are just getting the length of a string wouldn't both return the same result?

1

1 Answers

3
votes

String.length/1 returns the numbers of graphemes (each one may be one or more than one byte), while byte_size/1 deals with raw data bytes.

iex> byte_size "👩‍👩‍👧"
18
iex>  "👩‍👩‍👧" <> <<0>>
<<240, 159, 145, 169, 226, 128, 141, 240, 159, 145, 169, 226, 128, 141, 240, 159, 145, 167, 0>>

iex> String.length "👩‍👩‍👧"
1

iex> String.length "a"
1
iex> byte_size "a"
1

from the doc:

String and binary operations

To act according to the Unicode Standard, many functions in this module run in linear time, as they need to traverse the whole string considering the proper Unicode codepoints.

For example, String.length/1 will take longer as the input grows. On the other hand, Kernel.byte_size/1 always runs in constant time (i.e. regardless of the input size).

Not directly relevant but if you want to know more about Unicode and char encoding, you can read this article and watch this video