7
votes

I try to parse TPCH files with Boost Spirit QI. My implementation inspired by the employee example of Spirit QI ( http://www.boost.org/doc/libs/1_52_0/libs/spirit/example/qi/employee.cpp ). The data is in csv format and the tokens are delimited with a '|' character.

It works but it is very slow (20 sec. for 1 GB).

Here is my qi grammer for the lineitem file:

struct lineitem {
    int l_orderkey;
    int l_partkey;
    int l_suppkey;
    int l_linenumber;
    std::string l_quantity;
    std::string l_extendedprice;
    std::string l_discount;
    std::string l_tax;
    std::string l_returnflag;
    std::string l_linestatus;
    std::string l_shipdate;
    std::string l_commitdate;
    std::string l_recepitdate;
    std::string l_shipinstruct;
    std::string l_shipmode;
    std::string l_comment;
};

BOOST_FUSION_ADAPT_STRUCT( lineitem,
    (int, l_orderkey)
    (int, l_partkey)
    (int, l_suppkey)
    (int, l_linenumber)
    (std::string, l_quantity)
    (std::string, l_extendedprice)
    (std::string, l_discount)
    (std::string, l_tax)
    (std::string, l_returnflag)
    (std::string, l_linestatus)
    (std::string, l_shipdate)
    (std::string, l_commitdate)
    (std::string, l_recepitdate)
    (std::string, l_shipinstruct)
    (std::string, l_shipmode)
    (std::string, l_comment)) 

vector<lineitem>* lineitems=new vector<lineitem>();

phrase_parse(state->dataPointer,
    state->dataEndPointer,
    (*(int_ >> "|" >>
    int_ >> "|" >> 
    int_ >> "|" >>
    int_ >> "|" >>
    +(char_ - '|') >> "|" >>
    +(char_ - '|') >> "|" >>
    +(char_ - '|') >> "|" >>
    +(char_ - '|') >> "|" >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' 
    ) ), space, *lineitems
);

The problem seems to be the character parsing. It is much slower than other conversions. Is there a better way to parse variable length tokens into strings?

3
I once experienced the same. Spirit qi seems not to be able to handle variable length strings efficiently. Anyone has a solution for that?muehlbau

3 Answers

5
votes

I found a solution to my problem. As described in this post Boost Spirit QI grammar slow for parsing delimited strings the performance bottleneck is the string handling of Spirit qi. All other data types seem to be quite fast.

I avoid this problem through doing the handling of the data on my own instead of using the Spirit qi handling.

My solution uses a helper class which offers functions for every field of the csv file. The functions store the values into a struct. Strings are stored in a char[]s. Hits the parser a newline character it calls a function which adds the struct to the result vector. The Boost parser calls this functions instead of storing the values into a vector on its own.

Here is my code for the region.tbl file of the TCPH Benchmark:

struct region{
    int r_regionkey;
    char r_name[25];
    char r_comment[152];
};

class regionStorage{
public:
regionStorage(vector<region>* regions) :regions(regions), pos(0) {}
void storer_regionkey(int const&i){
    currentregion.r_regionkey = i;
}

void storer_name(char const&i){
    currentregion.r_name[pos] = i;
    pos++;
}

void storer_comment(char const&i){
    currentregion.r_comment[pos] = i;
    pos++;
}

void resetPos() {
    pos = 0;
}

void endOfLine() {
    pos = 0;
    regions->push_back(currentregion);
}

private:
vector<region>* regions;
region currentregion;
int pos;
};


void parseRegion(){

    vector<region> regions;
    regionStorage regionstorageObject(&regions);
    phrase_parse(dataPointer, /*< start iterator >*/    
     state->dataEndPointer, /*< end iterator >*/
     (*(lexeme[
     +(int_[boost::bind(&regionStorage::storer_regionkey, &regionstorageObject, _1)] - '|') >> '|' >>
     +(char_[boost::bind(&regionStorage::storer_name, &regionstorageObject, _1)] - '|') >> char_('|')[boost::bind(&regionStorage::resetPos, &regionstorageObject)] >>
     +(char_[boost::bind(&regionStorage::storer_comment, &regionstorageObject, _1)] - '|') >> char_('|')[boost::bind(&regionStorage::endOfLine, &regionstorageObject)]
    ])), space);

   cout << regions.size() << endl;
}

It is not a pretty solution but it works and it is much faster. ( 2.2 secs for 1 GB TCPH data, multithreaded)

4
votes

The problem mainly comes from appending individual char elements to std::string container. According to your grammar, for each std::string attribute the allocation starts when a char is met and stops when you find a | separator. So, at first there are sizeof(char)+1 reserved bytes (null-terminated "\0"). The compiler will have to run the allocator of std::string depending on the allocators doubling algorithm! That means the memory has to be re-allocated very frequently for small strings. This means your string is copied to a memory allocation double it's size and the previous allocation is freed, at intervals of 1,2,4,6,12,24... characters. No wonder it was slow, this causes huge problems with the frequent malloc calls; more heap fragmentation, a bigger linked list of free memory blocks, variable (small) sizes of those memory blocks which at it's turn causes issues with longer scanning of memory for the application's allocations during it's entire lifetime. tldr; the data becomes fragmented and widely dispersed in the memory.

Proof? The following code is called by the char_parser each time a valid character is met in your Iterator. From Boost 1.54

/boost/spirit/home/qi/char/char_parser.hpp

if (first != last && this->derived().test(*first, context))
{
    spirit::traits::assign_to(*first, attr_);
    ++first;
    return true;
}
return false;

/boost/spirit/home/qi/detail/assign_to.hpp

// T is not a container and not a string
template <typename T_>
static void call(T_ const& val, Attribute& attr, mpl::false_, mpl::false_)
{
    traits::push_back(attr, val);
}

/boost/spirit/home/support/container.hpp

template <typename Container, typename T, typename Enable/* = void*/>
struct push_back_container
{
    static bool call(Container& c, T const& val)
    {
        c.insert(c.end(), val);
        return true;
    }
};

The correction follow-up code you posted (changing your struct to char Name[Size]) is basically the same as adding a string Name.reserve(Size) statement directive. However, there's no directive for this at the moment.

The Solution:

/boost/spirit/home/support/container.hpp

template <typename Container, typename T, typename Enable/* = void*/>
struct push_back_container
{
    static bool call(Container& c, T const& val, size_t initial_size = 8)
    {
        if (c.capacity() < initial_size)
            c.reserve(initial_size);
        c.insert(c.end(), val);
        return true;
    }
};

/boost/spirit/home/qi/char/char_parser.hpp

if (first != last && this->derived().test(*first, context))
{
    spirit::traits::assign_to(*first, attr_);
    ++first;
    return true;
}
if (traits::is_container<Attribute>::value == true)
    attr_.shrink_to_fit();
return false;

I haven't tested it but I assume it can speed up char parsers over string attributes by over 10x like you saw. It would be a great optimization feature in a Boost Spirit update, including a reserve(initial_size)[ +( char_ - lit("|") ) ] directive that sets the initial buffer size.

0
votes

Are you using -O2 when compiling?

Boosts libraries have a lot of redundancy that are removed when using optimization flags.

Another possible solution is use Repetition Parser Directive: http://www.boost.org/doc/libs/1_52_0/libs/spirit/doc/html/spirit/qi/reference/directive/repeat.html