【发布时间】:2022-01-04 00:41:41
【问题描述】:
我正在尝试清理一些彼此非常不同的文本。我想删除实际上不结束句子的标题、引号、缩写、特殊符号和点。
示例输入:
This is a headline
And inside the text there are 'abbreviations', e.g. "bzw." in German or some German dates, like 2. Dezember 2017. Sometimes there are even enumerations, that I might just eliminate completely.
• they have
◦ different bullet points
- or even equations and
Sometimes there are special symbols. ✓
示例输出:
And inside the text there are abbreviations, for example beziehungsweise in German or some German dates, like 2 Dezember 2017. Sometimes there are even enumerations, that I might just eliminate completely. Sometimes there are special symbols.
我做了什么:
with open(r'C:\\Users\me\\Desktop\\ex.txt', 'r', encoding="utf8") as infile:
data = infile.read()
data = data.replace("'", '')
data = data.replace("e.g.", 'for example')
#and so on
with open(r'C:\\Users\me\\Desktop\\ex.txt', 'w', encoding="utf8") as outfile:
outfile.write(data)
我的问题(虽然第 2 点是最重要的):
-
我只想要一个带有此输入的字符串,但它显然因为引号而中断,除了像我一样处理文件之外,还有什么方法可以做到这一点?实际上,我正在复制粘贴文本并希望应用程序来清理它。
-
代码看起来效率很低,因为我只是手动编写了我记得要删除/清理的东西,但我不知道所有的缩写。可以这么说,如何一次性清理干净?
-
有没有办法消除标题和枚举,以及出现在那个德国日期中的
.点?我的代码没有这样做。
编辑:我只记得text = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", text) 之类的东西,但是正则表达式对于大文本来说效率很低,不是吗?
【问题讨论】:
-
您可以使用三引号将字符串放入普通变量中。
标签: python string text replace nlp