0
votes

I would like to capture all occurrences within a string that match a specific regular expression. I'm using DataWeave 2.0 (which means Mule Runtime 4.3 and, in my case Anypoint Studio 7.5)

I've tried to use scan() and match() from the DataWeave core library, but I can't quite get the result I want.

Here's some of the things I've tried:

%dw 2.0
output application/json

// sample input with hashtag keywords
var microList = 'Someone is giving away millions. See @realmcsrooge at #downtownmalls now!
#shoplocal and tell them #giveaway @barry sent you. #downtowndancehalls'
---
{
    withscan: microList scan /(#[^\s]*).*/,
    sanitized: microList replace /\n/ 
        with ' ',
    sani_match: microList replace /\n/ 
        with ' ' match /.*(#[^\s]*).*/, // gives full string and last match
    sani_scan: microList replace /\n/ 
        with ' ' scan /.*(#[^\s]*).*/   // gives array of arrays, string and last match
}

Here are the respective results:

{
  "withscan": [
    [
      "#downtownmalls now!",
      "#downtownmalls"
    ],
    [
      "#shoplocal and tell them #giveaway @barry sent you. #downtowndancehalls",
      "#shoplocal"
    ]
  ],
  "sanitized": "Someone is giving away millions. See @realmcsrooge at #downtownmalls now! #shoplocal and tell them #giveaway @barry sent you. #downtowndancehalls",
  "sani_match": [
    "Someone is giving away millions. See @realmcsrooge at #downtownmalls now! #shoplocal and tell them #giveaway @barry sent you. #downtowndancehalls",
    "#downtowndancehalls"
  ],
  "sani_scan": [
    [
      "Someone is giving away millions. See @realmcsrooge at #downtownmalls now! #shoplocal and tell them #giveaway @barry sent you. #downtowndancehalls",
      "#downtowndancehalls"
    ]
  ]
}

In the first example, it appears that the parser is doing line processing. So there is one element in the result array for each line. An element consists of the full matched portion and the tagged portion using the first occurrence of the pattern.

After stripping newlines, the third example (sani_match) gave me an array with the fully matched portion and the tagged portion, this time the last occurrence of the pattern on the line.

The final pattern (sani_scan) gives similar results, the only difference being that the result is embedded as an element in array of arrays.

What I want is simply an array with all occurrences of a specified pattern.

2

2 Answers

3
votes

If you want to capture all occurrences within a string that match a specific regular expression, I found that the magic words are "Overlapping Matches".

If what you really want is to get the hashed tags from the string, just use Valdi_Bo solution

To enable single-line flag in Java, you need to add (?s) at the beginning.

script:

%dw 2.0
output application/json

var str = 'Someone is giving away millions. See @realmcsrooge at #downtownmalls now!
#shoplocal and tell them #giveaway @barry sent you. #downtowndancehalls'
---
{
    // (?s) is the single-line modifier
    // (?=(X)). enable overlapping matches
    matchUntilEnd: str scan(/(?s)(?=(#([^\s]*).*))./) map $[1],
    justTags: str scan(/(?s)#([^\s]*)/) map $[1],
    Valdi_BoSolutionWithGroups: str scan(/#([\S]+)/) map $[1]
}

output:

{
  "matchUntilEnd": [
    "#downtownmalls now!\n#shoplocal and tell them #giveaway @barry sent you. #downtowndancehalls",
    "#shoplocal and tell them #giveaway @barry sent you. #downtowndancehalls",
    "#giveaway @barry sent you. #downtowndancehalls",
    "#downtowndancehalls"
  ],
  "justTags": [
    "downtownmalls",
    "shoplocal",
    "giveaway",
    "downtowndancehalls"
  ],
  "Valdi_BoSolutionWithGroups": [
    "downtownmalls",
    "shoplocal",
    "giveaway",
    "downtowndancehalls"
  ]
}
1
votes

If you want to match all "words" (actually non-blank chars) starting with # use a pattern like:

#[\S]+

i.e.:

  • # - represents itself,
  • [\S]+ - a non-empty sequence of non-white chars.

I think, you can do the job without capturing groups.

Another hint is to be very cautious when using .* in patterns, as it is likely to match either too little or too much.

In your first example (withscan) trailing .* in the pattern "consumes" the whole rest of the current line (up to a newline (excluding), as a dot does not match the newline). So if this rest of line contains another "#..." fragment, it has no chance to be matched by your capturing group.

To capture all occurrences of #... string, you should generally pass global option to the regex processor, but maybe DataWeave uses this option by default (I don't know this language).

Take also a look at a working example at https://regex101.com/r/NPiMok/1 (a convenient regex testing site).