想要从 pandas 数据框中删除数字并实现 CountVectorizer答案

【问题标题】：Want to remove numbers from pandas dataframe and implement CountVectorizer想要从 pandas 数据框中删除数字并实现 CountVectorizer
【发布时间】：2019-09-12 20:06:20
【问题描述】：

我有以下形式的数据：

    author  text
0   garyvee     A lot of people misunderstand Gary’s message o...
1   jasonfried  "I can’t remember having a goal. An actual goa...
2   biz         "Tools that can create media that looks and so...

我尝试了以下方法来清理文本：

text_data.loc[:,"text"] = text_data.text.apply(lambda x : str.lower(x))
text_data.loc[:,"text"] = text_data.text.apply(lambda x : " ".join(re.findall('[\w]+',x)))

我得到了输出，但它包含我不希望用于文本分析的数字

0    a lot of people misunderstand gary s message o...
1    i can t remember having a goal an actual goal ...
2    tools that can create media that looks and sou...
Name: text, dtype: object

但在删除文本字符串中的数字时：

text_data.loc[:,"text"] = text_data.text.apply(lambda x : " ".join(re.sub('^[0-9\.]*$','',x)))

我得到了输出：

0    a l o t o f p e o p l e m i s u n d e r s t a ...
1    i c a n t r e m e m b e r h a v i n g a g o a ...
2    t o o l s t h a t c a n c r e a t e m e d i a ...
Name: text, dtype: object

如何避免？如何实现CountVectorizer？

【问题讨论】：

你为什么用" ".join？
删除了，但文本数据中仍有数字，但现在所有单词都是离散的。
你的正则表达式正确吗？手动检查您的正则表达式是否正确。
'000', '100', '12', '16', '1st', '20', '200', '20s', '2nd', '30s', '3rd ', '50', '5000', '503c', '52', '57', 'a12zracs8z'，这些字怎么去掉？
哦，想通了np

标签： python pandas dataframe nlp

【解决方案1】：

我在这个阶段实际上犯了错误：

text_data.loc[:,"text"] = text_data.text.apply(lambda x : " ".join(re.sub('^[0-9\.]*$','',x)))

应该是

text_data.loc[:,"text"] = text_data.text.apply(lambda x : re.sub('^[0-9\.]*$','',x))

【讨论】：