151
votes

I'm looking for a neat RegEx solution to replace

  • All non Alpha-Numeric Characters
  • All NewLines
  • All multiple instances of white space

With a single space


For those playing at home (the following does work)

text.replace(/[^a-z0-9]/gmi, " ").replace(/\s+/g, " ");

My thinking is RegEx is probably powerful enough to achieve this in one statement. The components i think id need are

  • [^a-z0-9] - to Remove non Alpha-Numeric characters
  • \s+ - match any collections of spaces
  • \r?\n|\r - match all new line
  • /gmi - global, multi-line, case insensitive

However, i cant seem to style the regex in the right way (the following doesn't work)

text.replace(/[^a-z0-9]|\s+|\r?\n|\r/gmi, " ");


Input

234&^%,Me,2 2013 1080p x264 5 1 BluRay
S01(*&asd 05
S1E5
1x05
1x5


Desired Output

234 Me 2 2013 1080p x264 5 1 BluRay S01 asd 05 S1E5 1x05 1x5
8
How exactly does your attempt not work? What goes wrong?Pointy

8 Answers

269
votes

Be aware, that \W leaves the underscore. A short equivalent for [^a-zA-Z0-9] would be [\W_]

text.replace(/[\W_]+/g," ");

\W is the negation of shorthand \w for [A-Za-z0-9_] word characters (including the underscore)

Example at regex101.com

145
votes

Jonny 5 beat me to it. I was going to suggest using the \W+ without the \s as in text.replace(/\W+/g, " "). This covers white space as well.

15
votes

Since [^a-z0-9] character class contains all that is not alnum, it contains white characters too!

 text.replace(/[^a-z0-9]+/gi, " ");
8
votes

Well I think you just need to add a quantifier to each pattern. Also the carriage-return thing is a little funny:

text.replace(/[^a-z0-9]+|\s+/gmi, " ");

edit The \s thing matches \r and \n too.

4
votes

A saw a different post that also had diacritical marks, which is great

s.replace(/[^a-zA-Z0-9À-ž\s]/g, "")

4
votes

This is an old post of mine, the accepted answers are good for the most part. However i decided to benchmark each solution and another obvious one (just for fun). I wondered if there was a difference between the regex patterns on different browsers with different sized strings.

So basically i used jsPerf on

  • Testing in Chrome 65.0.3325 / Windows 10 0.0.0
  • Testing in Edge 16.16299.0 / Windows 10 0.0.0

The regex patterns i tested were

  • /[\W_]+/g
  • /[^a-z0-9]+/gi
  • /[^a-zA-Z0-9]+/g

I loaded them up with a string length of random characters

  • length 5000
  • length 1000
  • length 200

Example javascript i used var newstr = str.replace(/[\W_]+/g," ");

Each run consisted of 50 or more sample on each regex, and i run them 5 times on each browser.

Lets race our horses!

Results

                                Chrome                  Edge
Chars   Pattern                 Ops/Sec     Deviation   Op/Sec      Deviation
------------------------------------------------------------------------
5,000   /[\W_]+/g                19,977.80  1.09         10,820.40  1.32
5,000   /[^a-z0-9]+/gi           19,901.60  1.49         10,902.00  1.20
5,000   /[^a-zA-Z0-9]+/g         19,559.40  1.96         10,916.80  1.13
------------------------------------------------------------------------
1,000   /[\W_]+/g                96,239.00  1.65         52,358.80  1.41
1,000   /[^a-z0-9]+/gi           97,584.40  1.18         52,105.00  1.60
1,000   /[^a-zA-Z0-9]+/g         96,965.80  1.10         51,864.60  1.76
------------------------------------------------------------------------
  200   /[\W_]+/g               480,318.60  1.70        261,030.40  1.80
  200   /[^a-z0-9]+/gi          476,177.80  2.01        261,751.60  1.96
  200   /[^a-zA-Z0-9]+/g        486,423.00  0.80        258,774.20  2.15

Truth be known, Regex in both browsers (taking into consideration deviation) were nearly indistinguishable, however i think if it run this even more times the results would become a little more clearer (but not by much).

Theoretical scaling for 1 character

                            Chrome                        Edge
Chars   Pattern             Ops/Sec     Scaled            Op/Sec    Scaled
------------------------------------------------------------------------
5,000   /[\W_]+/g            19,977.80  99,889,000       10,820.40  54,102,000
5,000   /[^a-z0-9]+/gi       19,901.60  99,508,000       10,902.00  54,510,000
5,000   /[^a-zA-Z0-9]+/g     19,559.40  97,797,000       10,916.80  54,584,000
------------------------------------------------------------------------

1,000   /[\W_]+/g            96,239.00  96,239,000       52,358.80  52,358,800
1,000   /[^a-z0-9]+/gi       97,584.40  97,584,400       52,105.00  52,105,000
1,000   /[^a-zA-Z0-9]+/g     96,965.80  96,965,800       51,864.60  51,864,600
------------------------------------------------------------------------

  200   /[\W_]+/g           480,318.60  96,063,720      261,030.40  52,206,080
  200   /[^a-z0-9]+/gi      476,177.80  95,235,560      261,751.60  52,350,320
  200   /[^a-zA-Z0-9]+/g    486,423.00  97,284,600      258,774.20  51,754,840

I wouldn't take to much into these results as this is not really a significant differences, all we can really tell is edge is slower :o . Additionally that i was super bored.

Anyway you can run the benchmark for your self.

Jsperf Benchmark here

2
votes

To replace with dashes, do the following:

text.replace(/[\W_-]/g,' ');
1
votes

For anyone still strugging (like me...) after the above more expert replies, this works in Visual Studio 2019:

outputString = Regex.Replace(inputString, @"\W", "_");

Remember to add

using System.Text.RegularExpressions;