在 Python 中从我的字符串中删除无效和非 ascii 字符答案

【问题标题】：strip out non valid and non-ascci character from my string in Python在 Python 中从我的字符串中删除无效和非 ascii 字符
【发布时间】：2019-03-03 10:54:27
【问题描述】：

尝试格式化这个字符串并去掉非ascii字符

import re 
text = '<phone_number><![CDATA[0145236243 <0x0C><0x05><0x4>

]>' clean = re.sub('[^\x00-\x7f]',"", text)

这似乎无法正常工作。有人有适当的解决方案吗？我还上传了一个文件，以防 stackoverflow 格式化了非 ascci 字符。

【问题讨论】：

预期输出是什么？
类似这样的文本 = ''
How can I remove non-ASCII characters but leave periods and spaces using Python?的可能重复
你例子中的所有字符都是ASCIIchar
您的文本中没有非 ascii 字符。你只有字符和数字。您的预期输出也包含contact_number，应该是phone_number，但我认为这是一个错字

标签： python regex ascii

【解决方案1】：

此链接对于所有非 UTF-8 字符也有类似的解决方案。 Regular expression that finds and replaces non-ascii characters with Python

您可以尝试使用 str.encode() 和 str.decode() 来实现此目的。

然后你可以替换它们。

【讨论】：

【解决方案2】：

不是一个非常通用的。但以下解决方案可能适合您

''.join([i for i in text.split() if('<0x') not in i])#'<phone_number><![CDATA[0145236243]]></phone_number>'

使用正则表达式

 re.sub('(<0x\w*>)|\s',"", text) # '<phone_number><![CDATA[0145236243]]></phone_number>'

【讨论】：