2
votes

I have some texts separated by whitespaces.

Something like 123 10.03.1 TEXT1 TEXT2 TEXT3 TEXT4 TEXT5 TEXT6 2015/10/10 2012.

I am able to get everything that is not the "TEXT1 TEXT2 TEXT3 TEXT4 TEXT5 TEXT6" i want to repeat the capture group for that text:

TEXT1   TEXT2  TEXT3    TEXT4 TEXT5 TEXT6

I want to repeat the capture group, something like.

(\s*\w)*

But i want to ignore the whitespace Is there a way to just ignore the whitespace on the regex?

I will use the boost::regex_search to get the capture groups. Is there a way to archive that, i tried to use the "?:" on the capture group, but probably i am missing something.

2
Just use matching with \S+ to match a nonwhitespace chunk of text.Wiktor Stribiżew

2 Answers

5
votes

I strongly suspect you want something capable of parsing a grammar; Here's with Boost Spirit X3:

Live On Coliru

#include <boost/spirit/home/x3.hpp>
#include <boost/tuple/tuple_io.hpp>
#include <boost/fusion/adapted/boost_tuple.hpp>
#include <iostream>

namespace std {
    // hack for debug output
    std::ostream& operator<<(std::ostream& os, std::vector<std::string> const& v) {
        for (auto i = 0ul; i<v.size(); ++i) {
            if (i) os << " ";
            os << v[i];
        }
        return os;
    }
}

namespace x3 = boost::spirit::x3;

int main() {
    std::string const input = "123 10.03.1    TEXT1   TEXT2   TEXT3      TEXT4  TEXT5 \t   \tTEXT6 2015/10/10 \t  2012";

    int num;
    std::string version;
    std::vector<std::string> texts;
    std::string date;
    int year;

    auto attr = boost::tie(num, version, texts, date, year);
    bool ok = false;

    {
        using namespace x3;

        auto date_    = raw [ repeat(4) [ digit ] >> '/' >> repeat(2) [ digit ] >> '/' >> repeat(2) [ digit ] ];
        auto version_ = lexeme [ +char_("0-9.") ];
        auto text_    = lexeme [ alpha >> *alnum ];

        ok = phrase_parse(input.begin(), input.end(),
                int_ >> version_ >> *text_ >> date_ >> int_ /* >> eoi */,
                x3::space,
                attr);
    }

    if (ok) {
        std::cout << "parsed: " << attr << "\n";
    } else {
        std::cout << "parse failed\n";
    }
}

Prints

parsed: (123 10.03.1 TEXT1 TEXT2 TEXT3 TEXT4 TEXT5 TEXT6 2015/10/10 2012)

Note how this does much more than just split up your input. It ignores whitespace where required, assigns the converted values to integers, puts the TEXTn elements in a vector etc.

You can also parse these from a stream if you care (see boost::spirit::istream_iterator).

2
votes

It's unclear what you actually want. What do you want "captured" (how will you read the value and what do you expect it to be?).

As it is described now, you could just use .* to capture all that.

If you really want to just ignore the whitespace, do a regex replace \s+ to replace with " ".

UPDATE Sample:

Live On Coliru

#include <boost/regex.hpp>
#include <iostream>

int main() {
    std::string input = "TEXT1    TEXT2    TEXT3  TEXT4  TEXT5    TEXT6";

    std::cout << input << "\n";
    input = boost::regex_replace(input, boost::regex("\\s+"), " ");

    std::cout << input << "\n";

}

If you want to parse the tokens use tokenizer or regex_iterator.

If you have a more complicated grammar, consider using Boost Spirit Qi.