Python从unicode字符串中删除标点符号，撇号除外答案

【问题标题】：Python removing punctuation from unicode string except apostrophePython从unicode字符串中删除标点符号，撇号除外
【发布时间】：2015-07-07 22:41:12
【问题描述】：

我找到了这个的几个主题，我找到了这个解决方案：

sentence=re.sub(ur"[^\P{P}'|-]+",'',sentence)

这应该删除除'之外的所有标点符号，问题是它还会从句子中删除所有其他内容。

例子：

>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> sentence=re.sub(ur"[^\P{P}']+",'',sentence)
>>> print sentence
'

当然我想要的是保持句子没有标点符号，并且“沃霍尔的”保持原样

期望的输出：

"warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music"
"austro-hungarian empire"

编辑：我也尝试过使用

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
    if unicodedata.category(unichr(i)).startswith('P')) 
sentence = sentence.translate(tbl)

但这会删除所有标点符号

【问题讨论】：

here 它说除了 ' 之外的所有标点符号
糟糕，你是对的；不是精通新的regex 模块结构。

标签： python regex unicode punctuation

【解决方案1】：

指定您不想删除的所有元素，即\w、\d、\s 等。这就是^ 运算符在方括号中的含义。（匹配除此之外的任何内容）

>>> import re
>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> print re.sub(ur"[^\w\d'\s]+",'',sentence)
warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music
>>>

【讨论】：

这适用于撇号，如何添加更多例外？像 - 或类似的东西？
只需将\- 添加到ur"..