0
votes

Trying to format this string and strip out the non-ascii characters

import re 
text = '<phone_number><![CDATA[0145236243 <0x0C><0x05><0x4>

]]>' clean = re.sub('[^\x00-\x7f]',"", text)

This does not seem to do the job properly.Does someone have a proper solution. I have also uploaded a file in case stackoverflow has formatted the non-ascci characters.

2
what is the expected output?mad_
something like this text = '<contact_number><![CDATA[07744454]]></contact_number>'Jonas Amara
all the characters in you example are ASCII charGsk
You dont have have non-ascii characters in your text. You just have characters and numbers. Also your expected out contains contact_number and should be phone_number but I assume that is a typomad_

2 Answers

1
votes

Not a very generic one. But the below solution might work for you

''.join([i for i in text.split() if('<0x') not in i])#'<phone_number><![CDATA[0145236243]]></phone_number>'

Using regex

 re.sub('(<0x\w*>)|\s',"", text) # '<phone_number><![CDATA[0145236243]]></phone_number>'
0
votes

This link also has a similar solution for all non UTF-8 characters. Regular expression that finds and replaces non-ascii characters with Python

You can try using str.encode() and str.decode() for this purpose.

Then you can replace them.