从文本中删除各种符号答案

【问题标题】：Removing various symbols from a text从文本中删除各种符号
【发布时间】：2022-01-04 00:41:41
【问题描述】：

我正在尝试清理一些彼此非常不同的文本。我想删除实际上不结束句子的标题、引号、缩写、特殊符号和点。

示例输入：

This is a headline

And inside the text there are 'abbreviations', e.g. "bzw." in German or some German dates, like 2. Dezember 2017. Sometimes there are even enumerations, that I might just eliminate completely.
• they have
◦ different bullet points
- or even equations and 
Sometimes there are special symbols. ✓

示例输出：

And inside the text there are abbreviations, for example beziehungsweise in German or some German dates, like 2 Dezember 2017. Sometimes there are even enumerations, that I might just eliminate completely. Sometimes there are special symbols.

我做了什么：

with open(r'C:\\Users\me\\Desktop\\ex.txt', 'r', encoding="utf8") as infile: 
    data = infile.read()
    data = data.replace("'", '')
    data = data.replace("e.g.", 'for example') 
    #and so on
with open(r'C:\\Users\me\\Desktop\\ex.txt', 'w', encoding="utf8") as outfile:
    outfile.write(data)

我的问题（虽然第 2 点是最重要的）：

我只想要一个带有此输入的字符串，但它显然因为引号而中断，除了像我一样处理文件之外，还有什么方法可以做到这一点？实际上，我正在复制粘贴文本并希望应用程序来清理它。
代码看起来效率很低，因为我只是手动编写了我记得要删除/清理的东西，但我不知道所有的缩写。可以这么说，如何一次性清理干净？
有没有办法消除标题和枚举，以及出现在那个德国日期中的. 点？我的代码没有这样做。

编辑：我只记得text = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", text) 之类的东西，但是正则表达式对于大文本来说效率很低，不是吗？

【问题讨论】：

您可以使用三引号将字符串放入普通变量中。

标签： python string text replace nlp

【解决方案1】：

要轻松删除所有非标准符号，您可以使用 str.isalnum()，它只对任何字母数字序列返回 true，或 str.isascii() 对任何 ascii 字符串返回。 isprintable() 似乎也可行。完整列表可以是found here 使用这些函数，您可以遍历字符串并过滤每个字符。所以是这样的：

filteredData = filter(str.isidentifier, data)

您还可以通过创建一个检查多个字符串变量的函数来组合它们，如下所示：

def FilterKey(char:str): return char.isidentifier() and char.isalpha()

可以像这样在过滤器中使用：

filteredData = filter(FilterKey, data)

如果返回 true，则包含在输出中，如果返回 false，则排除。

您还可以通过在函数的返回中包含自己对字符的检查来扩展它，然后，要删除大块字符串，您可以使用典型的 str.replace(old,new) 函数。

【讨论】：