从Python中的文本中删除标点符号[重复]答案

【问题标题】：Stripping punctuation from text in Python [duplicate]从Python中的文本中删除标点符号[重复]
【发布时间】：2017-03-02 03:49:48
【问题描述】：

我正在尝试从文本文件中获取标记（单词）并将其从所有标点符号中删除。我正在尝试以下方法：

import re 

with open('hw.txt') as f:
    lines_after_254 = f.readlines()[254:]
    sent = [word for line in lines_after_254 for word in line.lower().split()]
    words = re.sub('[!#?,.:";]', '', sent)

我收到以下错误：

return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer

【问题讨论】：

标签： python string mapreduce nlp special-characters

【解决方案1】：

您的脚本中有几件事。您不是在标记化，而是将所有内容拆分为单个字符！此外，您将在将所有内容拆分为字符后删除特殊字符。

更好的方法是读取输入字符串，删除特殊字符，然后标记输入字符串。

import re

# open the input text file and read
string = open('hw.txt').read()
print string

# remove the special charaters from the read string
no_specials_string = re.sub('[!#?,.:";]', '', string)
print no_specials_string

# split the text and store words in a list
words = no_specials_string.split()
print words

或者，如果你想先拆分成标记，然后删除特殊字符，你可以这样做：

import re

# open the input text file and read
string = open('hw.txt').read()
print string

# split the text and store words in a list
words = string.split()
print words

# remove special characters from each word in words
new_words = [re.sub('[!#?,.:";]', '', word) for word in words]
print new_words

【讨论】：

【解决方案2】：

re.sub 将应用于字符串而不是列表！

print re.sub(pattern, '', sent)

应该是

print [re.sub(pattern, '', s) for s in sent]

希望这会有所帮助！

【讨论】：

【解决方案3】：

使用下面的remove_puncts() 函数

import string
translator = str.maketrans('', '', string.punctuation)
def remove_puncts(input_string):
    return input_string.translate(translator)

示例用法

input_string = """"YH&W^(*D)#IU*DEO)#brhtr<><}{|_}vrthyb,.,''fehsvhrr;[vrht":"]`~!@#$%svbrxs"""
remove_puncts(input_string)
'YHWDIUDEObrhtrvrthybfehsvhrrvrhtsvbrxs'

编辑

速度比较

结果表明使用translator 方法比使用正则表达式替换更快

import re, string, time

pattern = '[!#?,.:";]'
def regex_sub(input_string):
    return re.sub(pattern, '', input_string)

translator = str.maketrans('', '', string.punctuation)
def string_translator(input_string):
    return input_string.translate(translator)

input_string = """cwsx#?;.frvcdr"""
string_translator(input_string)
regex_sub(input_string)

passes = 1000000
t1 = time()
for i in range(passes):
    a = string_translator(input_string)

t2 = time()
for i in range(passes):
    a = regex_sub(input_string)

t3 = time()

string_translator_time = t2 - t1
regex_sub_time = t3 - t2

print(string_translator_time) # 1.341651439666748
print(regex_sub_time) # 3.44773268699646

【讨论】：

【解决方案4】：

没有任何内容被读入您的列表

In [14]: with open('data', 'r') as f:
    ...:     l=f.readlines()[254:]
    ...:     

In [15]: l
Out[15]: []

假设你想要一个单词列表，试试这个

with open('data', 'r') as f:
     lines = [line.strip() for line in f]

sent= [w for word in lines[:254] for w in re.split('\s+', word)]

find = '[!#?,.:";]'
replace = ''

words = [re.sub(find, replace, word) for word in sent]

@Keerthana Prabhakaran 指出 re.sub 已更正

【讨论】：

这仍然保留错误！
错误是return _compile(pattern, flags).sub(repl, string, count)，这里sent是一个列表！！