用文本中的空格替换标点符号答案

【问题标题】：replace punctuation with space in text用文本中的空格替换标点符号
【发布时间】：2021-10-05 23:14:02
【问题描述】：

我有这样的文字Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan Eksioglu,Handsome cello wrapped hard magnet, Ideal for home or office. 我通过以下代码从该文本中删除了标点符号。

import string
string.punctuation
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree
#storing the puntuation free text
df_Train['BULLET_POINTS']= df_Train['BULLET_POINTS'].apply(lambda x:remove_punctuation(x))
df_Train.head()

在上面的代码中df_Train 是一个熊猫数据框，其中“BULLET_POINTS”列包含上述类型的文本数据。我得到的结果是Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan EksiogluHandsome cello wrapped hard magnet Ideal for home or office 请注意 Eksioglu 和 Handsome 两个词是如何组合在一起的，因为 , 之后没有空格。我需要一种方法来解决这个问题。

【问题讨论】：

不要删除，用空格代替。

标签： python regex pandas dataframe

【解决方案1】：

在这种情况下，用空格替换所有特殊字符是有意义的，然后剥离结果并将多个空格缩小到一个空格：

df['BULLET_POINTS'] = df['BULLET_POINTS'].str.replace(r'(?:[^\w\s]|_)+', ' ', regex=True).str.strip()

或者，如果您要处理大量标点符号 + 空格：

df['BULLET_POINTS'].str.replace(r'[\W_]+', ' ', regex=True).str.strip()

输出：

>>> df['BULLET_POINTS'].str.replace(r'(?:[^\w\s]|_)+', ' ', regex=True).str.strip()
0    Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan Eksioglu Handsome cello wrapped hard magnet  Ideal for home or office
Name: BULLET_POINTS, dtype: object

(?:[^\w\s]|_)+ 正则表达式匹配除单词和空格字符或下划线（即一个或多个非字母数字字符）以外的任何字符的一次或多次出现，并将它们替换为空格。

[\W_]+ 模式类似，但包含空格。

.str.strip() 部分是必需的，因为替换可能会导致前导/尾随空格。

【讨论】：