34
votes

We are currently migrating one of our oracle databases to UTF8 and we have found a few records that are near the 4000 byte varchar limit. When we try and migrate these record they fail as they contain characters that become multibyte UF8 characters. What I want to do within PL/SQL is locate these characters to see what they are and then either change them or remove them.

I would like to do :

SELECT REGEXP_REPLACE(COLUMN,'[^[:ascii:]],'')

but Oracle does not implement the [:ascii:] character class.

Is there a simple way doing what I want to do?

18
You'd probably want to replace ç by c etcetera. Throwing the entire character away is worse than discarding the diacritical mark.MSalters
We 1st need to find out what the characters are before deciding what to do with them.Paul Gilfedder

18 Answers

29
votes

I think this will do the trick:

SELECT REGEXP_REPLACE(COLUMN, '[^[:print:]]', '')
24
votes

If you use the ASCIISTR function to convert the Unicode to literals of the form \nnnn, you can then use REGEXP_REPLACE to strip those literals out, like so...

UPDATE table SET field = REGEXP_REPLACE(ASCIISTR(field), '\\[[:xdigit:]]{4}', '')

...where field and table are your field and table names respectively.

10
votes

I wouldn't recommend it for production code, but it makes sense and seems to work:

SELECT REGEXP_REPLACE(COLUMN,'[^' || CHR(1) || '-' || CHR(127) || '],'')
7
votes

In a single-byte ASCII-compatible encoding (e.g. Latin-1), ASCII characters are simply bytes in the range 0 to 127. So you can use something like [\x80-\xFF] to detect non-ASCII characters.

5
votes

The select may look like the following sample:

select nvalue from table
where length(asciistr(nvalue))!=length(nvalue)  
order by nvalue;
3
votes

There's probably a more direct way using regular expressions. With luck, somebody else will provide it. But here's what I'd do without needing to go to the manuals.

Create a PLSQL function to receive your input string and return a varchar2.

In the PLSQL function, do an asciistr() of your input. The PLSQL is because that may return a string longer than 4000 and you have 32K available for varchar2 in PLSQL.

That function converts the non-ASCII characters to \xxxx notation. So you can use regular expressions to find and remove those. Then return the result.

3
votes

The following also works:

select dump(a,1016), a from (
SELECT REGEXP_REPLACE (
          CONVERT (
             '3735844533120%$03  ',
             'US7ASCII',
             'WE8ISO8859P1'),
          '[^!@/\.,;:<>#$%&()_=[:alnum:][:blank:]]') a
  FROM DUAL);
2
votes

I had a similar issue and blogged about it here. I started with the regular expression for alpha numerics, then added in the few basic punctuation characters I liked:

select dump(a,1016), a, b
from
 (select regexp_replace(COLUMN,'[[:alnum:]/''%()> -.:=;[]','') a,
         COLUMN b
  from TABLE)
where a is not null
order by a;

I used dump with the 1016 variant to give out the hex characters I wanted to replace which I could then user in a utl_raw.cast_to_varchar2.

2
votes

I found the answer here:

http://www.squaredba.com/remove-non-ascii-characters-from-a-column-255.html

CREATE OR REPLACE FUNCTION O1DW.RECTIFY_NON_ASCII(INPUT_STR IN VARCHAR2)
RETURN VARCHAR2
IS
str VARCHAR2(2000);
act number :=0;
cnt number :=0;
askey number :=0;
OUTPUT_STR VARCHAR2(2000);
begin
str:=’^'||TO_CHAR(INPUT_STR)||’^';
cnt:=length(str);
for i in 1 .. cnt loop
askey :=0;
select ascii(substr(str,i,1)) into askey
from dual;
if askey < 32 or askey >=127 then
str :=’^'||REPLACE(str, CHR(askey),”);
end if;
end loop;
OUTPUT_STR := trim(ltrim(rtrim(trim(str),’^'),’^'));
RETURN (OUTPUT_STR);
end;
/

Then run this to update your data

update o1dw.rate_ipselect_p_20110505
set NCANI = RECTIFY_NON_ASCII(NCANI);
2
votes

Try the following:

-- To detect
select 1 from dual
where regexp_like(trim('xx test text æ¸¬è© ¦ “xmx” number²'),'['||chr(128)||'-'||chr(255)||']','in')

-- To strip out
select regexp_replace(trim('xx test text æ¸¬è© ¦ “xmxmx” number²'),'['||chr(128)||'-'||chr(255)||']','',1,0,'in')
from dual
1
votes

You can try something like following to search for the column containing non-ascii character :

select * from your_table where your_col <> asciistr(your_col);
1
votes

I had similar requirement (to avoid this ugly ORA-31061: XDB error: special char to escaped char conversion failed. ), but had to keep the line breaks.

I tried this from an excellent comment

'[^ -~|[:space:]]'

but got this ORA-12728: invalid range in regular expression .

but it lead me to my solution:

select t.*, regexp_replace(deta, '[^[:print:]|[:space:]]', '#') from  
    (select '-   <- strangest thing here, and I want to keep line break after
-' deta from dual ) t

displays (in my TOAD tool) as

in my toad tool

  • replace all that ^ => is not in the sets (of printing [:print:] or space |[:space:] chars)
0
votes

Answer given by Francisco Hayoz is the best. Don't use pl/sql functions if sql can do it for you.

Here is the simple test in Oracle 11.2.03

select s
     , regexp_replace(s,'[^'||chr(1)||'-'||chr(127)||']','') "rep ^1-127"
     , dump(regexp_replace(s,'['||chr(127)||'-'||chr(225)||']','')) "rep 127-255"
from (
select listagg(c, '') within group (order by c) s
  from (select 127+level l,chr(127+level) c from dual connect by level < 129))

And "rep 127-255" is

Typ=1 Len=30: 226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255

i.e for some reason this version of Oracle does not replace char(226) and above. Using '['||chr(127)||'-'||chr(225)||']' gives the desired result. If you need to replace other characters just add them to the regex above or use nested replace|regexp_replace if the replacement is different then '' (null string).

0
votes

Please note that whenever you use

regexp_like(column, '[A-Z]')

Oracle's regexp engine will match certain characters from the Latin-1 range as well: this applies to all characters that look similar to ASCII characters like Ä->A, Ö->O, Ü->U, etc., so that [A-Z] is not what you know from other environments like, say, Perl.

Instead of fiddling with regular expressions try changing for the NVARCHAR2 datatype prior to character set upgrade.

Another approach: instead of cutting away part of the fields' contents you might try the SOUNDEX function, provided your database contains European characters (i.e. Latin-1) characters only. Or you just write a function that translates characters from the Latin-1 range into similar looking ASCII characters, like

  • å => a
  • ä => a
  • ö => o

of course only for text blocks exceeding 4000 bytes when transformed to UTF-8.

0
votes

Thanks, this worked for my purposes. BTW there is a missing single-quote in the example, above.

REGEXP_REPLACE (COLUMN,'[^' || CHR (32) || '-' || CHR (127) || ']', ' '))

I used it in a word-wrap function. Occasionally there was an embedded NewLine/ NL / CHR(10) / 0A in the incoming text that was messing things up.

0
votes

As noted in this comment, and this comment, you can use a range.
Using Oracle 11, the following works very well:

SELECT REGEXP_REPLACE(dummy, '[^ -~|[:space:]]', '?') AS dummy FROM DUAL;

This will replace anything outside that printable range as a question mark.

This will run as-is so you can verify the syntax with your installation.
Replace dummy and dual with your own column/table.

-2
votes

Do this, it will work.

trim(replace(ntwk_slctor_key_txt, chr(0), ''))
-3
votes

I'm a bit late in answering this question, but had the same problem recently (people cut and paste all sorts of stuff into a string and we don't always know what it is). The following is a simple character whitelist approach:

SELECT est.clients_ref
  ,TRANSLATE (
              est.clients_ref
             ,   'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
              || REPLACE (
                          TRANSLATE (
                                     est.clients_ref
                                    ,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
                                    ,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
                                    )
                         ,'~'
                         )
             ,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
             )
      clean_ref

FROM edms_staging_table est