4
votes

I have a text, consisting of varying regex delimiters, followed by text. In this example, I have 3 regex delimiters (PatternA, B, C), and the text looks like this :

|..StringMatchingA..|..Text1..|..StringMatchingB..|..Text2..|..StringMatchingA..|..Text3..|..StringMatchingC..|..Text4..|

I am looking for an efficient Java solution to extract information as a list of triplet :

  • {PatternA, StringMatchingA, Text1}
  • {PatternB, StringMatchingB, Text2}
  • {PatternA, StringMatchingA, Text3}
  • {PatternC, StringMatchingC, Text4}

With this information, I know for each triplet, what is the pattern that has been matched, as well as the String that has matched it.

For the moment, I have this approach, but I guess I could do something far more efficient with advanced regex usage ?

   String pattern = "?=(PatternA|PatternB|PatternC)";
   String()[] tokens = input.split(pattern);
   for(String token : tokens)
   {
      //if start of token matches patternA ...
      //elseif start of token matches pattern B...
      //etc...  
   }

Remarks :

  • Patterns are mutually exclusive.
  • String always starts with at least one pattern.
1
If raw efficiency is your primary concern then you might find that you get better performance from a custom parser (reading in one character at a time until it hits a delimiter and then returning a token). Otherwise the only thing I can suggest is to use a private static final Pattern if you call split(pattern) frequently, because String.split(String) creates a new Pattern object every time it is called, which is costly in a loop.Bobulous
If you don't know the order of appearance of each token in the string, then putting all of them in an alternation is the usual solution ((PatternA)|(PatternB)|(PatternC)). However, it's not clear whether the patterns are mutual exclusive, or whether there exist a string which two of them can match. It's also not clear whether you want the "bump-along" to happen when none of the patterns match at a certain position.nhahtdh
I have just edited the post : patterns are mutually exclusive ; we can assume that string starts with one of the given patterns.David

1 Answers

0
votes

You can use a loop and inside the code block you can "eat" what you found at the beginning of text. In this way at every iteration parsing is quite simple and maintenable/expandible.

The simple rule is: eat what you found and process it.

Something like this

String chunk;
while(text.size() >0 {

    chunk = eat(text,pattern1);
    if (chunk.lengh()>0}{
       ...
       continue;
    }
    chunk = eat(text,pattern2);
    if (chunk.lengh()>0}{
       ...
       continue;
    }
 }

For perfomance reason you have to compile regexp patterns before entering the loop.

(consider also using a parser generator like ANTLR).