【发布时间】:2023-03-12 07:01:01
【问题描述】:
我尝试对文本数据进行聚类,数据清晰、标记化等。 如何在 Kmeans 或其他聚类模型中输入相似度矩阵?
from gensim import corpora
from gensim import models
from gensim.models import Word2Vec, WordEmbeddingSimilarityIndex
from gensim.similarities import SoftCosineSimilarity, SparseTermSimilarityMatrix
documents = list(data['clear_response'])
texts = [[text for text in doc.split()] for doc in documents]
dictionary = corpora.Dictionary(texts)
bow_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in texts]
tfidf = TfidfModel(dictionary=dictionary)
similarity_index = WordEmbeddingSimilarityIndex(w2v_model.wv)
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf, nonzero_limit=100)
docsim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=30)
model_sim = KMeans(n_clusters=10, init='k-means++').fit_predict(similarity_matrix)
clusters_sim = model.sim.labels_.tolist()
clusters_sim
TypeError: float() argument must be a string or a number, not 'SparseTermSimilarityMatrix'
【问题讨论】:
标签: python cluster-analysis data-analysis