【发布时间】:2016-07-21 01:42:25
【问题描述】:
我尝试在python中用nltk实现一个正则表达式分词器,结果是这样的:
>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"'?():-_`] # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]
但想要的结果是这样的:
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
为什么?哪里错了?
【问题讨论】:
-
尝试
from nltk.tokenize import RegexpTokenizer、tokenizer = RegexpTokenizer(pattern)然后tokenizer.tokenize(text) -
它在我的笔记本中返回
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']。可能是版本问题? (3.0.4) -
我尝试使用 python 3.5 但结果是这样的: [('', '', ''), ('', '', ''), ('', '-print ', ''), ('', '', ''), ('', '', '')]
-
啊哈,您应该将所有捕获组设置为非捕获。
([A-Z]\.)+>(?:[A-Z]\.)+,\w+(-\w+)*->\w+(?:-\w+)*和\$?\d+(\.\d+)?%?到\$?\d+(?:\.\d+)?%?
标签: python regex pattern-matching nltk