1
votes

I need rust code to read lines of a file, and break them into an array of slices. The working code is

use std::io::{self, BufRead};

fn main() {
    let stdin = io::stdin();
    let mut f = stdin.lock();
    let mut line : Vec<u8> = Vec::new();

    loop {
       line.clear();
       let sz = f.read_until(b'\n', &mut line).unwrap();
       if sz == 0 {break};
       let body : Vec<&[u8]> = line.split(|ch| *ch == b'\t').collect();
       DoStuff(body);
    }
}

However, that code is slower than I'd like. The code I want to write is

use std::io::{self, BufRead};

fn main() {
    let stdin = io::stdin();
    let mut f = stdin.lock();
    let mut line : Vec<u8> = Vec::new();
    let mut body: Vec<&[u8]> = Vec::new();

    loop {
       line.clear();
       let sz = f.read_until(b'\n', &mut line).unwrap();
       if sz == 0 {break};
       body.extend(&mut line.split(|ch| *ch == b'\t'));
       DoStuff(body);
       body.clear();
    }
}

but that runs afoul of the borrow checker.

In general, I'd like a class containing a Vec<u8> and an associated Vec<&[u8]>, which is the basis of a lot of C++ code I'm trying to replace.

Is there any way I can accomplish this?

I realize that I could replace the slices with pairs of integers, but that seems clumsy.

No, I can't just use the items from the iterator as they come through -- I need random access to the individual column values. In the simplified case where I do use the iterator directly, I get a 3X speedup, which is why I suspect a significant speedup by replacing collect with extend.

Other comments on this code is also welcome.

2
Can you include the signature for DoStuff? It looks like it currently takes ownership of a Vec<&[u8]>. And is that signature open for changes? Transferring ownership is antithetical to reusing allocations.kmdreko
A signature like &[&[u8]] wouldn't take ownership of the vec.kmdreko
I don't think that would solve the problem with the second example. If you extend body by borrowing into line then body's type is now Vec<&'line [u8]>, but line is mutated with line.clear(). How much slower than the cpp implementation are we talking @Andy? Do you have an upper bound for the number of splits of a line? Does Vec::with_capacity() help? After trying that, I might measure the performance of body.extend(line.split(|ch| *ch == b'\t').map(|v| v as *const [u8])); and see if it really makes much of a differenceflumpb
Sorry that i was sloppy with DoStuff(). It should have just been a comment that some use goes there. DoStuff() is not the issue; the problem is with extend.Andy Jewell
I don't understand how I would use with_capacity -- collect creates the Vec. The C++ is about 2.5X faster, although it's doing a little bit more stuff. When I replace my extend call with your extend(map) call, it tells me the trait std::iter::Extend<*const [u8]> is not implemented for `std::vec::Vec<&[u8]>Andy Jewell

2 Answers

0
votes

Just for sake of completeness, and since you are coming from C++, a more Rusty way of writing the code would be

use std::io::{self, BufRead};

fn do_stuff(body: &[&str]) {}

fn main() {
    for line in io::stdin().lock().lines() {
        let line = line.unwrap();
        let body = line.split('\t').collect::<Vec<_>>();
        do_stuff(&body);
    }
}

This uses .lines() from BufRead to get an iterator over \n-delimited lines from the input. It assumes that your input is actually valid UTF8, which in your code was not a requirement. If it is not UTF8, use .split(b'\n'), .split(b'\t') and &[&u8] instead.

Notice that this does allocate and subsequently free a new Vec via .collect() every time the loop executes. We are somewhat relying on the allocator's free-list to make this cheap. But it is correct in all cases.


The reason your second example does not compile (after fixing the DoStuff(&body) is this:

12 |        line.clear();
   |        ^^^^^^^^^^^^ mutable borrow occurs here
...
15 |        body.extend(&mut line.split(|ch| *ch == b'\t'));
   |        ----             ---- immutable borrow occurs here
   |        |
   |        immutable borrow later used here

The problem here is the loop: Line 12 line.clear() will execute after line 15 body.extend() from the second iteration onwards. But the compiler has figured out that body borrows from line (it contains references to the fields inside line). The call to line.clear() mutably borrows line - all of line - and as far as the compiler is concerned is free to do anything it wants with the data it holds. This is an error because line.clear() could possibly mutate data that body has borrowed immutably. The compiler does not reason about the fact that .clear() obviously does not mutate the borrowed data, quite the opposite in fact, but the compiler's reasoning stops at the function signature.

0
votes

I seems like the answer is

  1. No, it's not possible to reuse the vector of slices.
  2. The way to go is to make something like a slice, but with integer offsets rather than pointers. Code is attached, comments welcome.
  3. Performance is currently 15% better than the C++, but the C++ is part of a larger system, and is probably doing some additional stuff.
/// pointers into a vector, simulating a slice without the ownership issues
#[derive(Debug, Clone)]
pub struct FakeSlice {
    begin: u32,
    end: u32,
}

/// A line of a text file, broken into columns.
/// Access to the `lines` and `parts` is allowed, but should seldom be necessary
/// `line` does not include the trailing newline
/// An empty line contains one empty column
///```
/// use std::io::BufRead;
/// let mut data = b"one\ttwo\tthree\n";
/// let mut dp = &data[..];
/// let mut line = cdx::TextLine::new();
/// let eof = line.read(&mut dp).unwrap();
/// assert_eq!(eof, false);
/// assert_eq!(line.strlen(), 13);
/// line.split(b'\t');
/// assert_eq!(line.len(), 3);
/// assert_eq!(line.get(1), b"two");
///```
#[derive(Debug, Clone)]
pub struct TextLine {
    pub line: Vec<u8>,
    pub parts: Vec<FakeSlice>,
}

impl TextLine {
    /// make a new TextLine
    pub fn new() -> TextLine {
        TextLine {
            line: Vec::new(),
            parts: Vec::new(),
        }
    }
    fn clear(&mut self) {
        self.parts.clear();
        self.line.clear();
    }
    /// How many column in the line
    pub fn len(&self) -> usize {
        self.parts.len()
    }
    /// How many bytes in the line
    pub fn strlen(&self) -> usize {
        self.line.len()
    }
    /// should always be false, but required by clippy
    pub fn is_empty(&self) -> bool {
        self.parts.is_empty()
    }
    /// Get one column. Return an empty column if index is too big.
    pub fn get(&self, index: usize) -> &[u8] {
        if index >= self.parts.len() {
            &self.line[0..0]
        } else {
            &self.line[self.parts[index].begin as usize..self.parts[index].end as usize]
        }
    }
    /// Read a new line from a file, should generally be followed by `split`
    pub fn read<T: std::io::BufRead>(&mut self, f: &mut T) -> std::io::Result<bool> {
        self.clear();
        let sz = f.read_until(b'\n', &mut self.line)?;
        if sz == 0 {
            Ok(true)
        } else {
            if self.line.last() == Some(&b'\n') {
                self.line.pop();
            }
            Ok(false)
        }
    }
    /// split the line into columns
    /// hypothetically you could split on one delimiter, do some work, then split on a different delimiter.
    pub fn split(&mut self, delim: u8) {
        self.parts.clear();
        let mut begin: u32 = 0;
        let mut end: u32 = 0;
        #[allow(clippy::explicit_counter_loop)] // I need the counter to be u32
        for ch in self.line.iter() {
            if *ch == delim {
                self.parts.push(FakeSlice { begin, end });
                begin = end + 1;
            }
            end += 1;
        }
        self.parts.push(FakeSlice { begin, end });
    }
}