Rust split vector of bytes by specific bytes

Question

I have a file containing information that I want to load in the application. The file has some header infos as string, then multiple entries that are ended by ';' Some entries are used for different types and therefore lenght is variable, but all variables are separated by ','

Example:

\Some heading
\Some other heading

I,003f,3f3d00ed,"Some string",00ef,
0032,20f3
;

Y,02d1,0000,0000,"Name of element",
00000007,0,

00000000,0,
;

Y,02d1,0000,0000,"Name of element",30f0,2d0f,02sd,
00000007,0,

00000000,0,
;

I is one type of element Y is another type of element

What I want to achieve is, to bring the elements into different structs to work with. Most of the values are numbers but some are strings.

What I was able to achieve is:

Import the file as Vec<u8>
Put it in a string (can't do that directly, beacuse there may be UTF-8 problems in elements I'm not interested in)
Split it to a Vec<&str> by ';'
Pass the strings to functions depending on their type
Split it to a Vec by '\n'
Split it to a Vec by ','
Reading out the data I need and interpret from the strings (str::from_str_radix for example)
Buld the struct and return it

This seems not to be the way to go, since I start with bytes, allocate them as string and then again allocate numbers on most of the values.

So my question is:

Can I split the Vec<u8> into multiple vectors separated by ';' (byte 59), split these further by '\n' and split this further by ','. I assume it would be more performant to apply the bytes directly to the correct data-type. Or is my concern wrong?

@IvanC vec.split answers the title, but I don't think it's the solution. Seems to me like the solution is parsing, at least ad-hoc if not using a "proper" parser library. Plus you can't really split a Split so you'd have to invent a scheme to pair the split separator around in order to know your 3 levels of boundaries, which is basically ad-hoc parsing except bad. — Masklinn
Are you dead set on the config format you're using? You might want to look into some typed, easily serializable formats. — AdaShoelace
I'm set at the format, because I read a file from a 3rd party software — Michael Hugi

Acorn Acorn · Accepted Answer · 2020-11-21T19:01:04

Can I split the Vec into multiple vectors separated by ';' (byte 59), split these further by '\n' and split this further by ','.

Usually that is not going to work if the other bytes may appear in other places, like embedded in the strings.

Then there is also the question of how the strings are encoded, whether there are escape sequences, etc.

I assume it would be more performant to apply the bytes directly to the correct data-type. Or is my concern wrong?

Reading the entire file into memory and then performing several copies from one Vec to another Vec and another and so on is going to be slower than a single pass with a state machine of some kind. Not to mention it will make working with files bigger than memory extremely slow or impossible.

I wouldn't worry about performance until you have a working algorithm, in particular if you have to work with an undocumented, non-trivial, third-party format and you are not experienced at reading binary formats.

Rust split vector of bytes by specific bytes

1 Answers