I'm working on a routine that compares strings, but for better efficiency I need to remove all characters that are not letters or numbers.
I'm using multiple REPLACE
functions now, but maybe there is a faster and nicer solution ?
I'm working on a routine that compares strings, but for better efficiency I need to remove all characters that are not letters or numbers.
I'm using multiple REPLACE
functions now, but maybe there is a faster and nicer solution ?
None of these answers worked for me. I had to create my own function called alphanum which stripped the chars for me:
DROP FUNCTION IF EXISTS alphanum;
DELIMITER |
CREATE FUNCTION alphanum( str CHAR(255) ) RETURNS CHAR(255) DETERMINISTIC
BEGIN
DECLARE i, len SMALLINT DEFAULT 1;
DECLARE ret CHAR(255) DEFAULT '';
DECLARE c CHAR(1);
IF str IS NOT NULL THEN
SET len = CHAR_LENGTH( str );
REPEAT
BEGIN
SET c = MID( str, i, 1 );
IF c REGEXP '[[:alnum:]]' THEN
SET ret=CONCAT(ret,c);
END IF;
SET i = i + 1;
END;
UNTIL i > len END REPEAT;
ELSE
SET ret='';
END IF;
RETURN ret;
END |
DELIMITER ;
Now I can do:
select 'This works finally!', alphanum('This works finally!');
and I get:
+---------------------+---------------------------------+
| This works finally! | alphanum('This works finally!') |
+---------------------+---------------------------------+
| This works finally! | Thisworksfinally |
+---------------------+---------------------------------+
1 row in set (0.00 sec)
Hurray!
From a performance point of view, (and on the assumption that you read more than you write)
I think the best way would be to pre calculate and store a stripped version of the column, This way you do the transform less.
You can then put an index on the new column and get the database to do the work for you.
SELECT teststring REGEXP '[[:alnum:]]+';
SELECT * FROM testtable WHERE test REGEXP '[[:alnum:]]+';
See: http://dev.mysql.com/doc/refman/5.1/en/regexp.html
Scroll down to the section that says: [:character_class:]
If you want to manipulate strings the fastest way will be to use a str_udf, see:
https://github.com/hholzgra/mysql-udf-regexp
Since MySQL 8.0 you can use regular expression to remove non alphanumeric characters from a string. There is method REGEXP_REPLACE
Here is the code to remove non-alphanumeric characters:
UPDATE {table} SET {column} = REGEXP_REPLACE({column}, '[^0-9a-zA-Z ]', '')
Straight and battletested solution for latin and cyrillic characters:
DELIMITER //
CREATE FUNCTION `remove_non_numeric_and_letters`(input TEXT)
RETURNS TEXT
BEGIN
DECLARE output TEXT DEFAULT '';
DECLARE iterator INT DEFAULT 1;
WHILE iterator < (LENGTH(input) + 1) DO
IF SUBSTRING(input, iterator, 1) IN
('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'А', 'Б', 'В', 'Г', 'Д', 'Е', 'Ж', 'З', 'И', 'Й', 'К', 'Л', 'М', 'Н', 'О', 'П', 'Р', 'С', 'Т', 'У', 'Ф', 'Х', 'Ц', 'Ч', 'Ш', 'Щ', 'Ъ', 'Ы', 'Ь', 'Э', 'Ю', 'Я', 'а', 'б', 'в', 'г', 'д', 'е', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о', 'п', 'р', 'с', 'т', 'у', 'ф', 'х', 'ц', 'ч', 'ш', 'щ', 'ъ', 'ы', 'ь', 'э', 'ю', 'я')
THEN
SET output = CONCAT(output, SUBSTRING(input, iterator, 1));
END IF;
SET iterator = iterator + 1;
END WHILE;
RETURN output;
END //
DELIMITER ;
Usage:
-- outputs "hello12356"
SELECT remove_non_numeric_and_letters('hello - 12356-привет ""]')
The fastest way I was able to find (and using ) is with convert().
from Doc. CONVERT() with USING is used to convert data between different character sets.
Example:
convert(string USING ascii)
In your case the right character set will be self defined
NOTE from Doc. The USING form of CONVERT()
is available as of 4.1.0.
Based on the answer by Ryan Shillington, modified to work with strings longer than 255 characters and preserving spaces from the original string.
FYI there is lower(str)
in the end.
I used this to compare strings:
DROP FUNCTION IF EXISTS spacealphanum;
DELIMITER $$
CREATE FUNCTION `spacealphanum`( str TEXT ) RETURNS TEXT CHARSET utf8
BEGIN
DECLARE i, len SMALLINT DEFAULT 1;
DECLARE ret TEXT DEFAULT '';
DECLARE c CHAR(1);
SET len = CHAR_LENGTH( str );
REPEAT
BEGIN
SET c = MID( str, i, 1 );
IF c REGEXP '[[:alnum:]]' THEN
SET ret=CONCAT(ret,c);
ELSEIF c = ' ' THEN
SET ret=CONCAT(ret," ");
END IF;
SET i = i + 1;
END;
UNTIL i > len END REPEAT;
SET ret = lower(ret);
RETURN ret;
END $$
DELIMITER ;
I have written this UDF. However, it only trims special characters at the beginning of the string. It also converts the string to lower case. You can update this function if desired.
DELIMITER //
DROP FUNCTION IF EXISTS DELETE_DOUBLE_SPACES//
CREATE FUNCTION DELETE_DOUBLE_SPACES ( title VARCHAR(250) )
RETURNS VARCHAR(250) DETERMINISTIC
BEGIN
DECLARE result VARCHAR(250);
SET result = REPLACE( title, ' ', ' ' );
WHILE (result <> title) DO
SET title = result;
SET result = REPLACE( title, ' ', ' ' );
END WHILE;
RETURN result;
END//
DROP FUNCTION IF EXISTS LFILTER//
CREATE FUNCTION LFILTER ( title VARCHAR(250) )
RETURNS VARCHAR(250) DETERMINISTIC
BEGIN
WHILE (1=1) DO
IF( ASCII(title) BETWEEN ASCII('a') AND ASCII('z')
OR ASCII(title) BETWEEN ASCII('A') AND ASCII('Z')
OR ASCII(title) BETWEEN ASCII('0') AND ASCII('9')
) THEN
SET title = LOWER( title );
SET title = REPLACE(
REPLACE(
REPLACE(
title,
CHAR(10), ' '
),
CHAR(13), ' '
) ,
CHAR(9), ' '
);
SET title = DELETE_DOUBLE_SPACES( title );
RETURN title;
ELSE
SET title = SUBSTRING( title, 2 );
END IF;
END WHILE;
END//
DELIMITER ;
SELECT LFILTER(' !@#$%^&*()_+1a b');
Also, you could use regular expressions but this requires installing a MySql extension.
This can be done with a regular expression replacer function I posted in another answer and have blogged about here. It may not be the most efficient solution possible and might look overkill for the job in hand - but like a Swiss army knife, it may come in useful for other reasons.
It can be seen in action removing all non-alphanumeric characters in this Rextester online demo.
SQL (excluding the function code for brevity):
SELECT txt,
reg_replace(txt,
'[^a-zA-Z0-9]+',
'',
TRUE,
0,
0
) AS `reg_replaced`
FROM test;
I had a similar problem with trying to match last names in our database that were slightly different. For example, sometimes people entered the same person's name as "McDonald" and also as "Mc Donald", or "St John" and "St. John".
Instead of trying to convert the Mysql data, I solved the problem by creating a function (in PHP) that would take a string and create an alpha-only regular expression:
function alpha_only_regex($str) {
$alpha_only = str_split(preg_replace('/[^A-Z]/i', '', $str));
return '^[^a-zA-Z]*'.implode('[^a-zA-Z]*', $alpha_only).'[^a-zA-Z]*$';
}
Now I can search the database with a query like this:
$lastname_regex = alpha_only_regex($lastname);
$query = "SELECT * FROM my_table WHERE lastname REGEXP '$lastname_regex';
So far, the only alternative approach less complicated than the other answers here is to determine the full set of special characters of the column, i.e. all the special characters that are in use in that column at the moment, and then do a sequential replace of all those characters, e.g.
update pages set slug = lower(replace(replace(replace(replace(name, ' ', ''), '-', ''), '.', ''), '&', '')); # replacing just space, -, ., & only
.
This is only advisable on a known set of data, otherwise it's trivial for some special characters to slip past with a blacklist approach instead of a whitelist approach.
Obviously, the simplest way is to pre-validate the data outside of sql due to the lack of robust built-in whitelisting (e.g. via a regex replace).
I needed to get only alphabetic characters of a string in a procedure, and did:
SET @source = "whatever you want";
SET @target = '';
SET @i = 1;
SET @len = LENGTH(@source);
WHILE @i <= @len DO
SET @char = SUBSTRING(@source, @i, 1);
IF ((ORD(@char) >= 65 && ORD(@char) <= 90) || (ORD(@char) >= 97 && ORD(@char) <= 122)) THEN
SET @target = CONCAT(@target, @char);
END IF;
SET @i = @i + 1;
END WHILE;
Needed to replace non-alphanumeric characters rather than remove non-alphanumeric characters so I have created this based on Ryan Shillington's alphanum. Works for strings up to 255 characters in length
DROP FUNCTION IF EXISTS alphanumreplace;
DELIMITER |
CREATE FUNCTION alphanumreplace( str CHAR(255), d CHAR(32) ) RETURNS CHAR(255)
BEGIN
DECLARE i, len SMALLINT DEFAULT 1;
DECLARE ret CHAR(32) DEFAULT '';
DECLARE c CHAR(1);
SET len = CHAR_LENGTH( str );
REPEAT
BEGIN
SET c = MID( str, i, 1 );
IF c REGEXP '[[:alnum:]]' THEN SET ret=CONCAT(ret,c);
ELSE SET ret=CONCAT(ret,d);
END IF;
SET i = i + 1;
END;
UNTIL i > len END REPEAT;
RETURN ret;
END |
DELIMITER ;
Example:
select 'hello world!',alphanum('hello world!'),alphanumreplace('hello world!','-');
+--------------+--------------------------+-------------------------------------+
| hello world! | alphanum('hello world!') | alphanumreplace('hello world!','-') |
+--------------+--------------------------+-------------------------------------+
| hello world! | helloworld | hello-world- |
+--------------+--------------------------+-------------------------------------+
You'll need to add the alphanum function seperately if you want that, I just have it here for the example.
I tried a few solutions but at the end used replace
. My data set is part numbers and I fairly know what to expect. But just for sanity, I used PHP to build the long query:
$dirty = array(' ', '-', '.', ',', ':', '?', '/', '!', '&', '@');
$query = 'part_no';
foreach ($dirty as $dirt) {
$query = "replace($query,'$dirt','')";
}
echo $query;
This outputs something I used to get a headache from:
replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(part_no,' ',''),'-',''),'.',''),',',''),':',''),'?',''),'/',''),'!',''),'&',''),'@','')
if you are using php then....
try{
$con = new PDO ("mysql:host=localhost;dbname=dbasename","root","");
}
catch(PDOException $e){
echo "error".$e-getMessage();
}
$select = $con->prepare("SELECT * FROM table");
$select->setFetchMode(PDO::FETCH_ASSOC);
$select->execute();
while($data=$select->fetch()){
$id = $data['id'];
$column = $data['column'];
$column = preg_replace("/[^a-zA-Z0-9]+/", " ", $column); //remove all special characters
$update = $con->prepare("UPDATE table SET column=:column WHERE id='$id'");
$update->bindParam(':column', $column );
$update->execute();
// echo $column."<br>";
}
the alphanum function (self answered) have a bug, but I don't know why. For text "cas synt ls 75W140 1L" return "cassyntls75W1401", "L" from the end is missing some how.
Now I use
delimiter //
DROP FUNCTION IF EXISTS alphanum //
CREATE FUNCTION alphanum(prm_strInput varchar(255))
RETURNS VARCHAR(255)
DETERMINISTIC
BEGIN
DECLARE i INT DEFAULT 1;
DECLARE v_char VARCHAR(1);
DECLARE v_parseStr VARCHAR(255) DEFAULT ' ';
WHILE (i <= LENGTH(prm_strInput) ) DO
SET v_char = SUBSTR(prm_strInput,i,1);
IF v_char REGEXP '^[A-Za-z0-9]+$' THEN
SET v_parseStr = CONCAT(v_parseStr,v_char);
END IF;
SET i = i + 1;
END WHILE;
RETURN trim(v_parseStr);
END
//
(found on google)