【发布时间】:2018-02-27 06:03:30
【问题描述】:
我有输入文本,我想通过将标点符号与单词分开,同时考虑缩写和撇号来进行标记。我正在使用 python 和 nltk 库,但我认为我的正则表达式不正确,因为我仍然得到错误的输出。
# coding: utf-8
import re
import nltk
from nltk.tokenize import *
text = "\"Predictions suggesting that large changes in weight will
accumulate indefinitely in response to small sustained lifestyle
modifications rely on the half-century-old 3,500 calorie rule, which
equates a weight alteration of 2.2 lb to a 3,500 calories cumulative
deficit or increment,\" write the study authors Dr. Jampolis, Dr.
Chaudry, and Prof. Harlen, from N.P.C Clinic in OH. The 3,500- calorie
rule \"predicts that a person who increases daily energy expenditure by
100 calories by walking 1 mile per day\" will lose 50 pounds over five
years, the authors say. But the true weight loss is only about 10
pounds if calorie intake doesn't increase, \"because changes in mass
... alter the energy requirements of the body’s make-up.\" \"This is a
myth, strictly speaking, but the smaller amount of weight loss achieved
with small changes is clinically significant and should not be
discounted,\" says Dr. Melina Jampolis, CNN diet and fitness expert."
print(regexp_tokenize(text, pattern='(?:(?!\d)\w)+|\S+') )
感谢您的帮助。
【问题讨论】:
-
我不清楚你想要的输出是什么
-
所需的输出将是标记化的文本,但没有分隔带有诸如撇号之类的标点符号的单词(不保留为一个标记)和缩写词(N.P.C. 保留为一个标记)
-
所以你基本上只是想删除“/”、“\”、“”和引号?
-
是的,“...”对不起,如果这是微不足道的,我正在尝试学习如何使用 nltk 库,有些小事情让我感到困惑。