【发布时间】:2015-07-07 22:41:12
【问题描述】:
我找到了这个的几个主题,我找到了这个解决方案:
sentence=re.sub(ur"[^\P{P}'|-]+",'',sentence)
这应该删除除'之外的所有标点符号,问题是它还会从句子中删除所有其他内容。
例子:
>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> sentence=re.sub(ur"[^\P{P}']+",'',sentence)
>>> print sentence
'
当然我想要的是保持句子没有标点符号,并且“沃霍尔的”保持原样
期望的输出:
"warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music"
"austro-hungarian empire"
编辑: 我也尝试过使用
tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
if unicodedata.category(unichr(i)).startswith('P'))
sentence = sentence.translate(tbl)
但这会删除所有标点符号
【问题讨论】:
-
here 它说除了 ' 之外的所有标点符号
-
糟糕,你是对的;不是精通新的
regex模块结构。
标签: python regex unicode punctuation