【发布时间】:2019-12-21 22:33:15
【问题描述】:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
headers = ['label', 'sms_message']
df = pd.read_csv ('spam.csv', names = headers)
df ['label'] = df['label'].map({'ham': 0, 'spam': 1})
print (df.head(7))
print (df.shape)
count_vector = CountVectorizer()
#count_vector.fit(df)
y = count_vector.fit_transform(df)
count_vector.get_feature_names()
doc_array = y.toarray()
print (doc_array)
frequency_matrix = pd.DataFrame(doc_array, columns = count_vector.get_feature_names())
frequency_matrix
样本数据和输出:
label sms_message
0 0 Go until jurong point, crazy.. Available only ...
1 0 Ok lar... Joking wif u oni...
2 1 Free entry in 2 a wkly comp to win FA Cup fina...
3 0 U dun say so early hor... U c already then say...
(5573, 2)
[[1 0]
[0 1]]
label sms_message
0 1 0
1 0 1
我的问题:
我的 csv 文件基本上是多行短信。
我不明白为什么我只得到列标签的输出,而不是整行短信文本的输出。
感谢您的帮助。
【问题讨论】:
标签: python scikit-learn