【发布时间】:2019-09-12 20:06:20
【问题描述】:
我有以下形式的数据:
author text
0 garyvee A lot of people misunderstand Gary’s message o...
1 jasonfried "I can’t remember having a goal. An actual goa...
2 biz "Tools that can create media that looks and so...
我尝试了以下方法来清理文本:
text_data.loc[:,"text"] = text_data.text.apply(lambda x : str.lower(x))
text_data.loc[:,"text"] = text_data.text.apply(lambda x : " ".join(re.findall('[\w]+',x)))
我得到了输出,但它包含我不希望用于文本分析的数字
0 a lot of people misunderstand gary s message o...
1 i can t remember having a goal an actual goal ...
2 tools that can create media that looks and sou...
Name: text, dtype: object
但在删除文本字符串中的数字时:
text_data.loc[:,"text"] = text_data.text.apply(lambda x : " ".join(re.sub('^[0-9\.]*$','',x)))
我得到了输出:
0 a l o t o f p e o p l e m i s u n d e r s t a ...
1 i c a n t r e m e m b e r h a v i n g a g o a ...
2 t o o l s t h a t c a n c r e a t e m e d i a ...
Name: text, dtype: object
如何避免?如何实现CountVectorizer?
【问题讨论】:
-
你为什么用
" ".join? -
删除了,但文本数据中仍有数字,但现在所有单词都是离散的。
-
你的正则表达式正确吗?手动检查您的正则表达式是否正确。
-
'000', '100', '12', '16', '1st', '20', '200', '20s', '2nd', '30s', '3rd ', '50', '5000', '503c', '52', '57', 'a12zracs8z',这些字怎么去掉?
-
哦,想通了np
标签: python pandas dataframe nlp