strip out non valid and non-ascci character from my string in Python

Question

Trying to format this string and strip out the non-ascii characters

import re 
text = '<phone_number><![CDATA[0145236243 <0x0C><0x05><0x4>

]>' clean = re.sub('[^\x00-\x7f]',"", text)

This does not seem to do the job properly.Does someone have a proper solution. I have also uploaded a file in case stackoverflow has formatted the non-ascci characters.

something like this text = '<contact_number><![CDATA[07744454]]></contact_number>' — Jonas Amara
Possible duplicate of How can I remove non-ASCII characters but leave periods and spaces using Python? — mad_
You dont have have non-ascii characters in your text. You just have characters and numbers. Also your expected out contains contact_number and should be phone_number but I assume that is a typo — mad_

mad_ mad_ · Accepted Answer · 2018-09-27T14:47:06

Not a very generic one. But the below solution might work for you

''.join([i for i in text.split() if('<0x') not in i])#'<phone_number><![CDATA[0145236243]]></phone_number>'

Using regex

 re.sub('(<0x\w*>)|\s',"", text) # '<phone_number><![CDATA[0145236243]]></phone_number>'

strip out non valid and non-ascci character from my string in Python

2 Answers