Java: RegExp for matching words between a quote

3

votes

I have the following test string

This is my "te

st" case
with lines for "tes"t"ing" with regex
But as he said "It could be an arbitrary number of words"

And I want to match everything which is between " as long as it is bound to words. I have the following regexp:

\"([^\"]*)\"

which matches quite well the words of "test" even if its split apart. Is there a way to find a tes"t"ing as well a whole word (and not split apart into two words? Trying with the word boundaries \b (\b\"([^\"]*)\"\b) doesn't work very well because it won't match the very first " nor the just mentioned group.

I need it for Java regexp.

UPDATE As a result I need to have

This is my \q{te

st} case
with lines for \q{tes"t"ing} with regex
But as he said \q{It could be an arbitrary number of words}

javaregex

Given the example in your question. What would be the desired output? – dpr

Sorry for the delay. I've updated now. – LeO

“bound to words” is unclear as you evidently wish to match ”te st” but that is not a word (is not comprised entirely of word characters). What about ”te x st”? Do you wish to match b in a”b”? – Cary Swoveland

te x ting should be matched while a"b test should be ignored, ie a"b "Test" should match "Test" only. matching empty would be great as well. Honestly I didn't test all the cases. But I need to have it for Java definitely – LeO

2

votes

You may use this regex that used lookbehind and lookahead to ensure that previous and next characters is not a non-whitespace character:

(?<!\S)".*?"(?!\S)

RegEx Demo

Adding helpful comment from OP which worked to solve the problem which was a bit more than what was mentioned in question:

str = str.replaceAll("(?s)(?<!\\S)\"(.*?)\"(?!\\S)", "\\\\q{$1}");

2

votes

You may use

.replaceAll("\\B\"\\b(.*?)\\b\"\\B", "\\\\q{$1}")

Or, if the matches may span across multiple lines, add (?s) modifier:

.replaceAll("(?s)\\B\"\\b(.*?)\\b\"\\B", "\\\\q{$1}")

See the regex demo .

Details

\B"\b - a " that is either at the start of the string or preceded with a non-word char, and that is followed with a word char
(.*?) - Group 1: any zero or more chars other than line break chars, as few as possible
\b"\B - a " that is either at the end of the string or followed with a non-word char, and that is preceded with a word char.

The replacement is a backslash ("\\\\", note the double literal backslash is necessary in the regex replacement part to insert a real, literal backslash since a backslash is a special char in the replacement pattern), q{, the Group1 value ($1) and a }.

See the Java demo:

String s = "This is my \"te\n\nst\" case\nwith lines for \"tes\"t\"ing\" with regex\nBut as he said \"It could be an arbitrary number of words\"";
System.out.println(s.replaceAll("\\B\"\\b(.*?)\\b\"\\B", "\\\\q{$1}"));

Output:

This is my "te

st" case
with lines for \q{tes"t"ing} with regex
But as he said \q{It could be an arbitrary number of words}

NOTE:

If you also need to match two consecutive double quotes that are not preceded, nor followed with word characters, you can modify the above regular expression the following way:

 .replaceAll("(?s)\\B(\"\\b(.*?)\\b\"|\"\")\\B", "\\\\q{$2}")

See the regex demo.

Details

(?s) - an embedded flag option (equal to Pattern.DOTALL) that makes . match line break chars, too
\B - a non-word boundary, here, it means that immediately to the left, there must be a non-word char or start of string (because after \B, there is a non-word char, ")
( - start of the first capturing group:
- "\b(.*?)\b" - " followed with a word char, then Group 2 capturing any zero or more chars, as few as possible, and then a " that is preceded with a word char (that is why this pattern can't match "", since after the first and before the second, there must be a letter, digit or _)
- | - or
- "" - a "" substring
) - end of the first capturing group
\B - a non-word boundary, here, it means that immediately to the right, there must be a non-word char or end of string (because before \B, there is a non-word char, ").

2

votes

You could use the regular expression

(?<=\")(?:[a-z]+\"[a-z]+\"[a-z]+|[a-z][^"]+)(?=\")

with the case-indifferent flag i (or preface with (?i)).

Demo

As seen at the link this regex matches the following three substrings of the text given in the question:

te                                                                    st
tes"t"ing
It could be an arbitrary number of words

The regex engine performs the following operations:

(?<=\")    # match a double-quote in a positive lookbehind
(?:        # begin a non-capture group
  [a-z]+\" # match 1+ letters, then a double-quote
  [a-z]+\" # match 1+ letters, then a double-quote
  [a-z]+   # match 1+ letters
  |        # or
  [a-z]    # match 1 letter
  [^"]+    # match 1+ characters other than a double-quote
)          # end non-capture group
(?=\")     # match a double-quote in a positive lookahead

Java: RegExp for matching words between a quote

3 Answers