通过将标点符号与单词分开而不是缩写和撇号来标记文本答案

【问题标题】：tokenizing text by separating the punctuation from the words but not abbreviations and apostrophes通过将标点符号与单词分开而不是缩写和撇号来标记文本
【发布时间】：2018-02-27 06:03:30
【问题描述】：

我有输入文本，我想通过将标点符号与单词分开，同时考虑缩写和撇号来进行标记。我正在使用 python 和 nltk 库，但我认为我的正则表达式不正确，因为我仍然得到错误的输出。

# coding: utf-8
import re
import nltk
from nltk.tokenize import *

text = "\"Predictions suggesting that large changes in weight will 
accumulate indefinitely in response to small sustained lifestyle 
modifications rely on the half-century-old 3,500 calorie rule, which 
equates a weight alteration of 2.2 lb to a 3,500 calories cumulative 
deficit or increment,\" write the study authors Dr. Jampolis, Dr. 
Chaudry, and Prof. Harlen, from N.P.C Clinic in OH. The 3,500- calorie 
rule \"predicts that a person who increases daily energy expenditure by 
100 calories by walking 1 mile per day\" will lose 50 pounds over five 
years, the authors say. But the true weight loss is only about 10 
pounds if calorie intake doesn't increase, \"because changes in mass 
... alter the energy requirements of the body’s make-up.\" \"This is a 
myth, strictly speaking, but the smaller amount of weight loss achieved 
with small changes is clinically significant and should not be 
discounted,\" says Dr. Melina Jampolis, CNN diet and fitness expert."

print(regexp_tokenize(text, pattern='(?:(?!\d)\w)+|\S+') )

感谢您的帮助。

【问题讨论】：

我不清楚你想要的输出是什么
所需的输出将是标记化的文本，但没有分隔带有诸如撇号之类的标点符号的单词（不保留为一个标记）和缩写词（N.P.C. 保留为一个标记）
所以你基本上只是想删除“/”、“\”、“”和引号？
是的，“...”对不起，如果这是微不足道的，我正在尝试学习如何使用 nltk 库，有些小事情让我感到困惑。

标签： python nltk tokenize

【解决方案1】：

这应该可以解决问题。在这里只使用 re.sub 来替换任何不受欢迎的标点符号是有意义的（即 ''）。

s = 'Insert your text here'

new = re.sub(r'(\"\\\")|(\\\")|[.]{3}|,','', s)

print(new)

这个正则表达式的难点在于转义所有的反斜杠。分解：

(\"\\\")

找到任何“\”

(\\\")

找到任何\"

[.]{3}

找到任何...

找到任何，

管道充当“或”运算符。希望这能满足您的所有要求。

【讨论】：