1
votes

I have a very loose regex to match any kind of url inside a string: [a-z]+[:.].*?(?=\s|$) The only problem is that this regex will also match the domain of an email, when instead i want to exclude from the match any email address.

To be precise i do want the following match (matched string in bold)

test example.com test

test [email protected]

Any solution i tried just excludes emailstring and matches myemail.com

Here's a more complete test case https://regex101.com/r/NsxzCM/3/

3
Is it really worth it trying to construct a monstrous error-prone regex that will filter out all url's but exclude emails? Wouldn't it be much easier to first find all url-like strings, and then check in the second step that they are not e-mail addresses? - Andrey Tyukin
It's a good point but i need to parse a text and replace the url with markup on the fly. Not sure how to do this in multiple steps. Unless i split the whole text by spaces, replace and then rejoin, but keeping track of which part was an email and which wasn't (so i can parse the url instead) will make it messy - Bolza
you can put an except @match before your match - Leonardo Scotti

3 Answers

5
votes

Here is a two-step proposal that uses regex replace with lambdas. The first regex finds everything that looks like an ordinary URL or an email, and the second regex then filters out the strings that look like email addresses:

input = 
  "test\n" +
  "example.com\n" +
  "www.example.com\n" +
  "test sub.example.com test\n" +
  "http://example.com\n" +
  "test http://www.example.com test\n" +
  "http://sub.example.com\n" +
  "https://example.com\n" +
  "https://www.example.com\n" +
  "https://sub.example.com\n" +
  "\n" +
  "test [email protected] <- i don't want to match this\n" +
  "[email protected]    <- i don't want to match this\n" +
  "\n" +
  "git://github.com/user/project-name.git\n" +
  "irc://irc.undernet.org:6667/mIRC jhasbdjkbasd\n";

includeRegex = /(?:[\w/:@-]+\.[\w/:@.-]*)+(?=\s|$)/g ;
excludeRegex = /.*@.*/ ;

result = input.replace(includeRegex, function(s) {
  if (excludeRegex.test(s)) {
    return s; // leave as-is
  } else {
    return "(that's a non-email url: " + s +")";
  }
});

console.log(result);
0
votes
(:^|[^@\.\w-])([-\w:.]{1,256}\.[\w()]{1,6}\b)

helps but i don't know why it matches extra \ as well

0
votes

I think you need something like this:

const URL_INCLUDE_REGEX = /[(http(s)?):\/\/(www\.)?a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)/ig;
const URL_EXCLUDE_REGEX = /.*@.*/;

The second one is for excluding emails. So the final code will be:

const text = "My website is example.com";
// const text = "My email is [email protected]"; <- this will not be matched as there is email, not a url

let result = false;

text.replace(URL_INCLUDE_REGEX, (matchedText) => {
  if(!URL_EXCLUDE_REGEX.test(matchedText)) {
    result = true;
  }
});
return result;

where result will be true or false