【问题标题】:Implementing Bag of Words in scikit-learn在 scikit-learn 中实现词袋
【发布时间】:2019-12-21 22:33:15
【问题描述】:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
headers = ['label', 'sms_message']
df = pd.read_csv ('spam.csv', names = headers)
df ['label'] = df['label'].map({'ham': 0, 'spam': 1})
print (df.head(7))
print (df.shape)
count_vector = CountVectorizer()
#count_vector.fit(df)
y = count_vector.fit_transform(df)
count_vector.get_feature_names()
doc_array = y.toarray()
print (doc_array)
frequency_matrix = pd.DataFrame(doc_array, columns = count_vector.get_feature_names())
frequency_matrix

样本数据和输出:

   label                                        sms_message
0      0  Go until jurong point, crazy.. Available only ...
1      0                      Ok lar... Joking wif u oni...
2      1  Free entry in 2 a wkly comp to win FA Cup fina...
3      0  U dun say so early hor... U c already then say...

(5573, 2)
[[1 0]
 [0 1]]

label   sms_message
0   1   0
1   0   1

我的问题:

我的 csv 文件基本上是多行短信。

我不明白为什么我只得到列标签的输出,而不是整行短信文本的输出。

感谢您的帮助。

【问题讨论】:

    标签: python scikit-learn


    【解决方案1】:

    仅将 sms_message 列传递给计数矢量化器,如下所示。

    import numpy as np
    import pandas as pd
    from sklearn.feature_extraction.text import CountVectorizer
    
    docs = ['Tea is an aromatic beverage..',
            'After water, it is the most widely consumed drink in the world',
            'There are many different types of tea.',
            'Tea has a stimulating effect in humans.',
            'Tea originated in Southwest China during the Shang dynasty'] 
    
    df = pd.DataFrame({'sms_message': docs, 'label': np.random.choice([0, 1], size=5)})
    
    cv = CountVectorizer()
    counts = cv.fit_transform(df['sms_message'])
    
    df_counts = pd.DataFrame(counts.A, columns=cv.get_feature_names())
    df_counts['label'] = df['label']
    

    输出:

    df_counts
    
    Out[26]: 
       after  an  are  aromatic  beverage  ...  types  water  widely  world  label
    0      0   1    0         1         1  ...      0      0       0      0      1
    1      1   0    0         0         0  ...      0      1       1      1      0
    2      0   0    1         0         0  ...      1      0       0      0      1
    3      0   0    0         0         0  ...      0      0       0      0      1
    4      0   0    0         0         0  ...      0      0       0      0      0
    
    [5 rows x 32 columns]
    

    【讨论】:

    • 首先谢谢你。第二:这是做什么的 "df_counts = pd.DataFrame(counts.A"...我在问 "counts.A" 部分 - 这是什么意思?thx !!!
    • counts.A 或等效的 counts.toarray() 输出不同术语的计数的密集矩阵表示。一些算法(如神经网络)需要密集数组才能使用,而其他算法则可以使用稀疏数组。在我的回答中,counts_df 在那里,以便您可以验证输出。
    【解决方案2】:

    使用@KRKirov 回答仅将列标题 ('sms_message) 传递给计数矢量化器, 我编辑了我的代码并得到了正确的输出:

    from sklearn.feature_extraction.text import CountVectorizer
    import pandas as pd
    import numpy as np
    
    headers = ['label', 'sms_message']
    df = pd.read_csv ('spam.csv', names = headers)
    df ['label'] = df['label'].map({'ham': 0, 'spam': 1})
    df ["sms_message"]= df["sms_message"].str.lower().str.replace('[^\w\s]','')
    
    count_vector = CountVectorizer()
    y = count_vector.fit_transform(df['sms_message'])
    doc_array = y.toarray()
    
    frequency_matrix = pd.DataFrame(doc_array, columns = count_vector.get_feature_names())
    frequency_matrix
    

    【讨论】:

      猜你喜欢
      • 2018-09-15
      • 1970-01-01
      • 2017-01-19
      • 2020-01-15
      • 2017-06-12
      • 1970-01-01
      • 2018-04-08
      • 2018-10-14
      • 2018-05-01
      相关资源
      最近更新 更多