3
votes

I have the following test string

This is my "te

st" case
with lines for "tes"t"ing" with regex
But as he said "It could be an arbitrary number of words"

And I want to match everything which is between " as long as it is bound to words. I have the following regexp:

\"([^\"]*)\"

which matches quite well the words of "test" even if its split apart. Is there a way to find a tes"t"ing as well a whole word (and not split apart into two words? Trying with the word boundaries \b (\b\"([^\"]*)\"\b) doesn't work very well because it won't match the very first " nor the just mentioned group.

I need it for Java regexp.

UPDATE As a result I need to have

This is my \q{te

st} case
with lines for \q{tes"t"ing} with regex
But as he said \q{It could be an arbitrary number of words}
3
Given the example in your question. What would be the desired output?dpr
Sorry for the delay. I've updated now.LeO
“bound to words” is unclear as you evidently wish to match ”te st” but that is not a word (is not comprised entirely of word characters). What about ”te x st”? Do you wish to match b in a”b”?Cary Swoveland
te x ting should be matched while a"b test should be ignored, ie a"b "Test" should match "Test" only. matching empty would be great as well. Honestly I didn't test all the cases. But I need to have it for Java definitelyLeO

3 Answers

2
votes

You may use this regex that used lookbehind and lookahead to ensure that previous and next characters is not a non-whitespace character:

(?<!\S)".*?"(?!\S)

RegEx Demo

Adding helpful comment from OP which worked to solve the problem which was a bit more than what was mentioned in question:

str = str.replaceAll("(?s)(?<!\\S)\"(.*?)\"(?!\\S)", "\\\\q{$1}"); 
2
votes

You may use

.replaceAll("\\B\"\\b(.*?)\\b\"\\B", "\\\\q{$1}")

Or, if the matches may span across multiple lines, add (?s) modifier:

.replaceAll("(?s)\\B\"\\b(.*?)\\b\"\\B", "\\\\q{$1}")

See the regex demo .

Details

  • \B"\b - a " that is either at the start of the string or preceded with a non-word char, and that is followed with a word char
  • (.*?) - Group 1: any zero or more chars other than line break chars, as few as possible
  • \b"\B - a " that is either at the end of the string or followed with a non-word char, and that is preceded with a word char.

The replacement is a backslash ("\\\\", note the double literal backslash is necessary in the regex replacement part to insert a real, literal backslash since a backslash is a special char in the replacement pattern), q{, the Group1 value ($1) and a }.

See the Java demo:

String s = "This is my \"te\n\nst\" case\nwith lines for \"tes\"t\"ing\" with regex\nBut as he said \"It could be an arbitrary number of words\"";
System.out.println(s.replaceAll("\\B\"\\b(.*?)\\b\"\\B", "\\\\q{$1}"));

Output:

This is my "te

st" case
with lines for \q{tes"t"ing} with regex
But as he said \q{It could be an arbitrary number of words}

NOTE:

If you also need to match two consecutive double quotes that are not preceded, nor followed with word characters, you can modify the above regular expression the following way:

 .replaceAll("(?s)\\B(\"\\b(.*?)\\b\"|\"\")\\B", "\\\\q{$2}")

See the regex demo.

Details

  • (?s) - an embedded flag option (equal to Pattern.DOTALL) that makes . match line break chars, too
  • \B - a non-word boundary, here, it means that immediately to the left, there must be a non-word char or start of string (because after \B, there is a non-word char, ")
  • ( - start of the first capturing group:
    • "\b(.*?)\b" - " followed with a word char, then Group 2 capturing any zero or more chars, as few as possible, and then a " that is preceded with a word char (that is why this pattern can't match "", since after the first and before the second, there must be a letter, digit or _)
    • | - or
    • "" - a "" substring
  • ) - end of the first capturing group
  • \B - a non-word boundary, here, it means that immediately to the right, there must be a non-word char or end of string (because before \B, there is a non-word char, ").
2
votes

You could use the regular expression

(?<=\")(?:[a-z]+\"[a-z]+\"[a-z]+|[a-z][^"]+)(?=\")

with the case-indifferent flag i (or preface with (?i)).

Demo

As seen at the link this regex matches the following three substrings of the text given in the question:

te                                                                    st
tes"t"ing
It could be an arbitrary number of words

​ The regex engine performs the following operations:

(?<=\")    # match a double-quote in a positive lookbehind
(?:        # begin a non-capture group
  [a-z]+\" # match 1+ letters, then a double-quote
  [a-z]+\" # match 1+ letters, then a double-quote
  [a-z]+   # match 1+ letters
  |        # or
  [a-z]    # match 1 letter
  [^"]+    # match 1+ characters other than a double-quote
)          # end non-capture group
(?=\")     # match a double-quote in a positive lookahead