1
votes

I would like to split a String such as "word1 AND word2 OR (word3 AND (word4 OR word5)) AND word6" with "AND" only outside from parenthesis to get : "word1" "word2 OR (word3 AND (word4 OR word5))" "word6"

Note that a bloc of parenthesis can contain many other blocs of parenthesis.

I've done some researches and I've found a regex that does the opposite of what I want which is : (?:[^AND(]|\([^)]*\))+ This regex selects every thing but "AND" outside of parenthesis. Also I tried lookahead and lookbehind but haven't been successful.

Is there a way of doing what I'm asking with a regex ?

Thanks

2
If you want to do this recursively, I do not think it is possible, because that would mean you have to find the proper closing parentheses for each expression, but the regular expressions cannot express those. - Gábor Bakos
Should result of split word1 AND ((word2 AND word3) AND word4) AND word5 be word1 ((word2 AND word3) AND word4) word5, OR maybe you want to split also middle word to ((word2 AND word3) and word4)? I am asking because you accepted answer which splits also middle part. - Pshemo
I'd like to have the first proposition : "word1" "((word2 AND word3) AND word4)" "word5" - beetix
Regular expressions are not the universal tool for parsing. It parses only regular grammar (plus some extensions). Matching parenthesis is context-free, if I remember correctly. - Siyuan Ren
Does it have to be a regex? Regular string operations can do it. - user1803551

2 Answers

0
votes

Consider creating your own parser for this task (it is not that complicated).

  1. Iterate over string characters to find ranges where you can't remove AND from. Create variable which will calculate level of nesting. Increase this level when you find ( and decrease it when you find ).
    • if you find ( and you changed level from 0 to 1 then it is start of range,
    • if you find ) and you changed level from 1 to 0 then it is end of range.
  2. Find positions of AND in your string (indexOf(data,fromIndex) can be helpful here) and check if it is outside of ranges you shouldn't split on.
  3. When you have all positions you should split on create substrings from start,position and update next start to be after positoon+"AND".length(). After this try to substring next part.

After point 3 you should have all parts you are interested in.


Below is example of parser class which seems to be doing what you want. To see it hover your mouse over it. But before you use it try to create your own implementation.

class Parser { private static class Range { private int start, end; public Range(int start, int end) { this.start = start; this.end = end; } boolean isInside(int i) { return start <= i && i <= end; } public int getStart() { return start; } @Override public String toString() { return "Range [start=" + start + ", end=" + end + "]"; } } private List<Range> ranges = new ArrayList<Range>(); private boolean checkIfOutsideRanges(int i) { if (ranges.size() == 0) return true; if (ranges.get(0).getStart() > i) return true; for (Range r : ranges) { if (r.isInside(i)) return false; } return true; }
private List<Range> setUpRanges(String data) { int level = 0; int startOfRange = 0; int i = 0; for (char ch : data.toCharArray()) { if (ch == '(') { level++; if (level == 1) startOfRange = i; } if (ch == ')') { level--; if (level == 0) ranges.add(new Range(startOfRange, i)); } i++; } return ranges; }
public List<String> parse(String data) { String toFind = "AND"; ranges = setUpRanges(data); //find indexes of "AND" we should split on List<Integer> toSplit = new ArrayList<Integer>(); int i = -1; do { i = data.indexOf(toFind, i + 1); if (i != -1 && checkIfOutsideRanges(i)) toSplit.add(i); } while (i != -1);
//split on correct AND indexes List<String> results = new ArrayList<String>(); int start = 0; for (Integer index : toSplit) { results.add(data.substring(start, index)); start = index + toFind.length(); } if (start < data.length()) results.add(data.substring(start)); return results; } }

Usage example

String data = "word1 AND ((word2 AND word3) AND word4) AND word5";
Parser p = new Parser();
for (String s : p.parse(data))
    System.out.println(s);
0
votes

For Pattern.Compile methode you can use Pattern.DOTALL as parameter. Code sampe is given

import java.util.regex.*;
public class Test
{
public static void main(String[] args)
{
    String s="word1 AND word2 OR (word3 AND (word4 OR word5)) AND word6";

    String regEx="(?:[^AND(]|\\([^)]*\\))+";
     Pattern pattern = Pattern.compile(regEx, Pattern.DOTALL);
     Matcher matcher = pattern.matcher(s);         

     while (matcher.find()) {             
        System.out.println("Found the text \"" + matcher.group() + "\" starting at " + matcher.start() + " index and ending at index " + matcher.end());         
    } 
}
}

Please try this.