【问题标题】:Clustering text data with Python's Scikit-Learn lib and plotting使用 Python 的 Scikit-Learn 库对文本数据进行聚类并绘图
【发布时间】:2019-12-25 16:12:14
【问题描述】:

我是聚类的新手,我正在学习文本聚类。 我找到了一种制作集群的方法,现在我试图找到一种方法来绘制它们。 这是我想绘制集群时遇到的错误:

ValueError: setting an array element with a sequence.

这是我的代码:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

x = ['this is very good show' , 'i had a great time on my school trip', 'such a boring movie', 'Springbreak was amazing'
     'i love this product' , 'this is an amazing item', 'this food is delicious', 'I had a great time last night', 
     'this is my favourite restaurant' , 'i love this food, its so good', 'skiing is the best sport', 'what is this',
     'I love basketball, its very dynamic' , 'its a shame that you missed the trip', 'game last night was amazing',
     'such a nice song' , 'this is the best movie ever', 'hawaii is the best place for trip','how that happened',
     'I cant believe that you did that', 'Why are you doing that, I do not gete it', 'this is tasty']

cv = CountVectorizer(analyzer = 'word', max_features = 5000, lowercase=True, preprocessor=None, tokenizer=None, stop_words = 'english')  
x = cv.fit_transform(x)    

my_list = []

for i in range(1,8):

    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 0)
    kmeans.fit(x)
    my_list.append(kmeans.inertia_)

plt.plot(range(1,8),my_list)
plt.show()


kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 0)
y_kmeans = kmeans.fit_predict(x)

plt.scatter(x[y_kmeans == 0,0], x[y_kmeans==0,1], s = 15, c= 'red', label = 'Cluster_1')
plt.scatter(x[y_kmeans == 1,0], x[y_kmeans==1,1], s = 15, c= 'blue', label = 'Cluster_2')
plt.scatter(x[y_kmeans == 2,0], x[y_kmeans==2,1], s = 15, c= 'green', label = 'Cluster_3')
plt.scatter(x[y_kmeans == 3,0], x[y_kmeans==3,1], s = 15, c= 'cyan', label = 'Cluster_4')
plt.scatter(x[y_kmeans == 4,0], x[y_kmeans==4,1], s = 15, c= 'magenta', label = 'Cluster_5')

plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], s = 100, c = 'black', label = 'Centroids')
plt.show()

我做错了什么,我想看看每个集群中分组了哪些句子,甚至可以这样绘制吗? 如何测试发现的集群的重要性?

【问题讨论】:

标签: python scikit-learn


【解决方案1】:

最初你的观察是句子。在对它们应用 CountVectorizer 之后,您的观察结果现在是 62 维向量。你从 pyplot 得到一个值错误(我不清楚你想绘制什么,因为你的向量是这么高的维度)。

据我所知,您的模型将对代词(“this”、“that”等)过于敏感。许多模型删除了这些和其他stop words

【讨论】:

  • 感谢您对停用词的回答。我想知道是否有可能绘制这样的图来表示图表上的句子/单词集群
  • 你的向量 y_kmeans 有你每个句子的簇号。您可以使用它来查看每个集群中正在重新组合哪些句子
  • 那怎么看?
  • 所以如果我添加stop_words = 'english',它会自动删除没有“价值”/“意义”的词吗?我想从我的集群中绘制句子组
猜你喜欢
  • 2014-10-02
  • 2015-03-09
  • 2016-03-17
  • 2014-10-28
  • 2012-12-30
  • 2017-03-22
  • 2020-02-13
  • 2019-10-08
  • 2018-12-09
相关资源
最近更新 更多