How to remove all non-alpha numeric characters from a string in MySQL?

65

votes

I'm working on a routine that compares strings, but for better efficiency I need to remove all characters that are not letters or numbers.

I'm using multiple REPLACE functions now, but maybe there is a faster and nicer solution ?

mysqlregexstringalphanumeric

i bet that the regular comparison is more efficient than first string manipulation then comparison. – Randy

94

votes

None of these answers worked for me. I had to create my own function called alphanum which stripped the chars for me:

DROP FUNCTION IF EXISTS alphanum; 
DELIMITER | 
CREATE FUNCTION alphanum( str CHAR(255) ) RETURNS CHAR(255) DETERMINISTIC
BEGIN 
  DECLARE i, len SMALLINT DEFAULT 1; 
  DECLARE ret CHAR(255) DEFAULT ''; 
  DECLARE c CHAR(1);
  IF str IS NOT NULL THEN 
    SET len = CHAR_LENGTH( str ); 
    REPEAT 
      BEGIN 
        SET c = MID( str, i, 1 ); 
        IF c REGEXP '[[:alnum:]]' THEN 
          SET ret=CONCAT(ret,c); 
        END IF; 
        SET i = i + 1; 
      END; 
    UNTIL i > len END REPEAT; 
  ELSE
    SET ret='';
  END IF;
  RETURN ret; 
END | 
DELIMITER ;

Now I can do:

select 'This works finally!', alphanum('This works finally!');

and I get:

+---------------------+---------------------------------+
| This works finally! | alphanum('This works finally!') |
+---------------------+---------------------------------+
| This works finally! | Thisworksfinally                |
+---------------------+---------------------------------+
1 row in set (0.00 sec)

Hurray!

22

votes

From a performance point of view, (and on the assumption that you read more than you write)

I think the best way would be to pre calculate and store a stripped version of the column, This way you do the transform less.

You can then put an index on the new column and get the database to do the work for you.

14

votes

SELECT teststring REGEXP '[[:alnum:]]+';

SELECT * FROM testtable WHERE test REGEXP '[[:alnum:]]+';

See: http://dev.mysql.com/doc/refman/5.1/en/regexp.html
Scroll down to the section that says: [:character_class:]

If you want to manipulate strings the fastest way will be to use a str_udf, see:
https://github.com/hholzgra/mysql-udf-regexp

12

votes

Since MySQL 8.0 you can use regular expression to remove non alphanumeric characters from a string. There is method REGEXP_REPLACE

Here is the code to remove non-alphanumeric characters:

UPDATE {table} SET {column} = REGEXP_REPLACE({column}, '[^0-9a-zA-Z ]', '')

9

votes

Straight and battletested solution for latin and cyrillic characters:

DELIMITER //

CREATE FUNCTION `remove_non_numeric_and_letters`(input TEXT)
  RETURNS TEXT
  BEGIN
    DECLARE output TEXT DEFAULT '';
    DECLARE iterator INT DEFAULT 1;
    WHILE iterator < (LENGTH(input) + 1) DO
      IF SUBSTRING(input, iterator, 1) IN
         ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'А', 'Б', 'В', 'Г', 'Д', 'Е', 'Ж', 'З', 'И', 'Й', 'К', 'Л', 'М', 'Н', 'О', 'П', 'Р', 'С', 'Т', 'У', 'Ф', 'Х', 'Ц', 'Ч', 'Ш', 'Щ', 'Ъ', 'Ы', 'Ь', 'Э', 'Ю', 'Я', 'а', 'б', 'в', 'г', 'д', 'е', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о', 'п', 'р', 'с', 'т', 'у', 'ф', 'х', 'ц', 'ч', 'ш', 'щ', 'ъ', 'ы', 'ь', 'э', 'ю', 'я')
      THEN
        SET output = CONCAT(output, SUBSTRING(input, iterator, 1));
      END IF;
      SET iterator = iterator + 1;
    END WHILE;
    RETURN output;
  END //

DELIMITER ;

Usage:

-- outputs "hello12356"
SELECT remove_non_numeric_and_letters('hello - 12356-привет ""]')

6

votes

The fastest way I was able to find (and using ) is with convert().

from Doc. CONVERT() with USING is used to convert data between different character sets.

Example:

convert(string USING ascii)

In your case the right character set will be self defined

NOTE from Doc. The USING form of CONVERT() is available as of 4.1.0.

6

votes

Based on the answer by Ryan Shillington, modified to work with strings longer than 255 characters and preserving spaces from the original string.

FYI there is lower(str) in the end.

I used this to compare strings:

DROP FUNCTION IF EXISTS spacealphanum;
DELIMITER $$
CREATE FUNCTION `spacealphanum`( str TEXT ) RETURNS TEXT CHARSET utf8
BEGIN 
  DECLARE i, len SMALLINT DEFAULT 1; 
  DECLARE ret TEXT DEFAULT ''; 
  DECLARE c CHAR(1); 
  SET len = CHAR_LENGTH( str ); 
  REPEAT 
    BEGIN 
      SET c = MID( str, i, 1 ); 
      IF c REGEXP '[[:alnum:]]' THEN 
        SET ret=CONCAT(ret,c); 
      ELSEIF  c = ' ' THEN
          SET ret=CONCAT(ret," ");
      END IF; 
      SET i = i + 1; 
    END; 
  UNTIL i > len END REPEAT; 
  SET ret = lower(ret);
  RETURN ret; 
  END $$
  DELIMITER ;

4

votes

Be careful, characters like ’ or » are considered as alpha by MySQL. It better to use something like :

IF c BETWEEN 'a' AND 'z' OR c BETWEEN 'A' AND 'Z' OR c BETWEEN '0' AND '9' OR c = '-' THEN

4

votes

I have written this UDF. However, it only trims special characters at the beginning of the string. It also converts the string to lower case. You can update this function if desired.

DELIMITER //

DROP FUNCTION IF EXISTS DELETE_DOUBLE_SPACES//

CREATE FUNCTION DELETE_DOUBLE_SPACES ( title VARCHAR(250) )
RETURNS VARCHAR(250) DETERMINISTIC
BEGIN
    DECLARE result VARCHAR(250);
    SET result = REPLACE( title, '  ', ' ' );
    WHILE (result <> title) DO 
        SET title = result;
        SET result = REPLACE( title, '  ', ' ' );
    END WHILE;
    RETURN result;
END//

DROP FUNCTION IF EXISTS LFILTER//

CREATE FUNCTION LFILTER ( title VARCHAR(250) )
RETURNS VARCHAR(250) DETERMINISTIC
BEGIN
    WHILE (1=1) DO
        IF(  ASCII(title) BETWEEN ASCII('a') AND ASCII('z')
            OR ASCII(title) BETWEEN ASCII('A') AND ASCII('Z')
            OR ASCII(title) BETWEEN ASCII('0') AND ASCII('9')
        ) THEN
            SET title = LOWER( title );
            SET title = REPLACE(
                REPLACE(
                    REPLACE(
                        title,
                        CHAR(10), ' '
                    ),
                    CHAR(13), ' '
                ) ,
                CHAR(9), ' '
            );
            SET title = DELETE_DOUBLE_SPACES( title );
            RETURN title;
        ELSE
            SET title = SUBSTRING( title, 2 );          
        END IF;
    END WHILE;
END//
DELIMITER ;

SELECT LFILTER(' !@#$%^&*()_+1a    b');

Also, you could use regular expressions but this requires installing a MySql extension.

3

votes

This can be done with a regular expression replacer function I posted in another answer and have blogged about here. It may not be the most efficient solution possible and might look overkill for the job in hand - but like a Swiss army knife, it may come in useful for other reasons.

It can be seen in action removing all non-alphanumeric characters in this Rextester online demo.

SQL (excluding the function code for brevity):

SELECT txt,
       reg_replace(txt,
                   '[^a-zA-Z0-9]+',
                   '',
                   TRUE,
                   0,
                   0
                   ) AS `reg_replaced`
FROM test;

1

votes

I had a similar problem with trying to match last names in our database that were slightly different. For example, sometimes people entered the same person's name as "McDonald" and also as "Mc Donald", or "St John" and "St. John".

Instead of trying to convert the Mysql data, I solved the problem by creating a function (in PHP) that would take a string and create an alpha-only regular expression:

function alpha_only_regex($str) {
    $alpha_only = str_split(preg_replace('/[^A-Z]/i', '', $str));
    return '^[^a-zA-Z]*'.implode('[^a-zA-Z]*', $alpha_only).'[^a-zA-Z]*$';
}

Now I can search the database with a query like this:

$lastname_regex = alpha_only_regex($lastname);
$query = "SELECT * FROM my_table WHERE lastname REGEXP '$lastname_regex';

1

votes

So far, the only alternative approach less complicated than the other answers here is to determine the full set of special characters of the column, i.e. all the special characters that are in use in that column at the moment, and then do a sequential replace of all those characters, e.g.

update pages set slug = lower(replace(replace(replace(replace(name, ' ', ''), '-', ''), '.', ''), '&', '')); # replacing just space, -, ., & only

.

This is only advisable on a known set of data, otherwise it's trivial for some special characters to slip past with a blacklist approach instead of a whitelist approach.

Obviously, the simplest way is to pre-validate the data outside of sql due to the lack of robust built-in whitelisting (e.g. via a regex replace).

0

votes

I needed to get only alphabetic characters of a string in a procedure, and did:

SET @source = "whatever you want";
SET @target = '';
SET @i = 1;
SET @len = LENGTH(@source);
WHILE @i <= @len DO
    SET @char = SUBSTRING(@source, @i, 1);
    IF ((ORD(@char) >= 65 && ORD(@char) <= 90) || (ORD(@char) >= 97 && ORD(@char) <= 122)) THEN
        SET @target = CONCAT(@target, @char);
    END IF;
    SET @i = @i + 1;
END WHILE;

0

votes

Needed to replace non-alphanumeric characters rather than remove non-alphanumeric characters so I have created this based on Ryan Shillington's alphanum. Works for strings up to 255 characters in length

DROP FUNCTION IF EXISTS alphanumreplace; 
DELIMITER | 
CREATE FUNCTION alphanumreplace( str CHAR(255), d CHAR(32) ) RETURNS CHAR(255) 
BEGIN 
  DECLARE i, len SMALLINT DEFAULT 1; 
  DECLARE ret CHAR(32) DEFAULT ''; 
  DECLARE c CHAR(1); 
  SET len = CHAR_LENGTH( str ); 
  REPEAT 
    BEGIN 
      SET c = MID( str, i, 1 ); 
      IF c REGEXP '[[:alnum:]]' THEN SET ret=CONCAT(ret,c); 
      ELSE SET ret=CONCAT(ret,d);
      END IF; 
      SET i = i + 1; 
    END; 
  UNTIL i > len END REPEAT; 
  RETURN ret; 
END | 
DELIMITER ;

Example:

select 'hello world!',alphanum('hello world!'),alphanumreplace('hello world!','-');
+--------------+--------------------------+-------------------------------------+
| hello world! | alphanum('hello world!') | alphanumreplace('hello world!','-') |
+--------------+--------------------------+-------------------------------------+
| hello world! | helloworld               | hello-world-                        |
+--------------+--------------------------+-------------------------------------+

You'll need to add the alphanum function seperately if you want that, I just have it here for the example.

-1

votes

I tried a few solutions but at the end used replace. My data set is part numbers and I fairly know what to expect. But just for sanity, I used PHP to build the long query:

$dirty = array(' ', '-', '.', ',', ':', '?', '/', '!', '&', '@');
$query = 'part_no';
foreach ($dirty as $dirt) {
    $query = "replace($query,'$dirt','')";
}
echo $query;

This outputs something I used to get a headache from:

replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(part_no,' ',''),'-',''),'.',''),',',''),':',''),'?',''),'/',''),'!',''),'&',''),'@','')

-1

votes

if you are using php then....

try{
$con = new PDO ("mysql:host=localhost;dbname=dbasename","root","");
}
catch(PDOException $e){
echo "error".$e-getMessage();   
}

$select = $con->prepare("SELECT * FROM table");
$select->setFetchMode(PDO::FETCH_ASSOC);
$select->execute();

while($data=$select->fetch()){ 

$id = $data['id'];
$column = $data['column'];
$column = preg_replace("/[^a-zA-Z0-9]+/", " ", $column); //remove all special characters

$update = $con->prepare("UPDATE table SET column=:column WHERE id='$id'");
$update->bindParam(':column', $column );
$update->execute();

// echo $column."<br>";
}

-1

votes

the alphanum function (self answered) have a bug, but I don't know why. For text "cas synt ls 75W140 1L" return "cassyntls75W1401", "L" from the end is missing some how.

Now I use

delimiter //
DROP FUNCTION IF EXISTS alphanum //
CREATE FUNCTION alphanum(prm_strInput varchar(255))
RETURNS VARCHAR(255)
DETERMINISTIC
BEGIN
  DECLARE i INT DEFAULT 1;
  DECLARE v_char VARCHAR(1);
  DECLARE v_parseStr VARCHAR(255) DEFAULT ' ';
WHILE (i <= LENGTH(prm_strInput) )  DO
  SET v_char = SUBSTR(prm_strInput,i,1);
  IF v_char REGEXP  '^[A-Za-z0-9]+$' THEN 
        SET v_parseStr = CONCAT(v_parseStr,v_char);  
  END IF;
  SET i = i + 1;
END WHILE;
RETURN trim(v_parseStr);
END
//

(found on google)

-2

votes

Probably a silly suggestion compared to others:

if(!preg_match("/^[a-zA-Z0-9]$/",$string)){
    $sortedString=preg_replace("/^[a-zA-Z0-9]+$/","",$string);
}

How to remove all non-alpha numeric characters from a string in MySQL?

18 Answers