【问题标题】:Python removing punctuation from unicode string except apostrophePython从unicode字符串中删除标点符号,撇号除外
【发布时间】:2015-07-07 22:41:12
【问题描述】:

我找到了这个的几个主题,我找到了这个解决方案:

sentence=re.sub(ur"[^\P{P}'|-]+",'',sentence)

这应该删除除'之外的所有标点符号,问题是它还会从句子中删除所有其他内容。

例子:

>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> sentence=re.sub(ur"[^\P{P}']+",'',sentence)
>>> print sentence
'

当然我想要的是保持句子没有标点符号,并且“沃霍尔的”保持原样

期望的输出:

"warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music"
"austro-hungarian empire"

编辑: 我也尝试过使用

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
    if unicodedata.category(unichr(i)).startswith('P')) 
sentence = sentence.translate(tbl)

但这会删除所有标点符号

【问题讨论】:

  • here 它说除了 ' 之外的所有标点符号
  • 糟糕,你是对的;不是精通新的regex 模块结构。

标签: python regex unicode punctuation


【解决方案1】:

指定您不想删除的所有元素,即\w\d\s 等。这就是^ 运算符在方括号中的含义。 (匹配除此之外的任何内容)

>>> import re
>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> print re.sub(ur"[^\w\d'\s]+",'',sentence)
warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music
>>> 

【讨论】:

  • 这适用于撇号,如何添加更多例外?像 - 或类似的东西?
  • 只需将\- 添加到ur"..
猜你喜欢
  • 2018-05-13
  • 2017-03-26
  • 1970-01-01
  • 1970-01-01
  • 2012-01-31
  • 2021-12-17
  • 1970-01-01
  • 2017-08-21
相关资源
最近更新 更多