【问题标题】：can i use K-means to predict a possible disease from a dataset?我可以使用 K-means 从数据集中预测可能的疾病吗？
【发布时间】：2021-05-20 11:43:31
【问题描述】：

下面的代码是我编写的，用于使用具有 3 个参数的数据集中的 k-Means 预测可能的疾病，这是正确的吗？但这并没有给出我想要的准确结果。

import pandas as pd #importing library for reading dataset
from sklearn.cluster import KMeans #using ML library in python for 
utilizing kmeans


##reading the dataset from csv file and storing in variable called data..
data = pd.read_csv(r"C:\Users\Hassan Tariq\Disease 
Prediction\DataSet.csv")

##selecting data cols from dataset.
X_Data = data.iloc[:,[1]] #first col as a part of first variable
Y_Data = data.iloc[:,[2,3]] ##second col as a part of second variable
##i have used two cols in second variable because we cannot train kmeans 
on three parameters.


#initializing the model with 3 initial clusters.
model1 = KMeans(n_clusters=3, random_state=3)

#training model on the selected data..
prediction = model1.fit_predict(X_Data,Y_Data)

#printing the clusters prediction from the model.
print("Clustered Dataset: \n",prediction)

#printing the centroids which shows the data behavior in each cluster
print("Centroids of the clusters formed: \n",model1.cluster_centers_)

centeroids_collection = model1.cluster_centers_

#specifying the diseases which can be possible.
disease1 = ['Muscle Twitching','Nausea']
disease2 = ['Eye Irritation', 'Lung Irritation']
disease3 = ['Eye Irritation','Diarrhea']

 #loop for iterating all the data in the dataset to predict the disease..

【问题讨论】：

K-Means 是一种无监督学习算法，因此这里没有“y”（fit_predict 接受它只是为了在 API 中保持一致性；它是 ignored）。恕我直言，您应该首先更彻底地研究 K-Means。
那么我应该使用监督学习算法吗？
我的意思是，我的数据集中有 3 个参数 Ph、浊度和 Tds，现在我想做的是，如果 ph>11 tds 是 200-300 并且浊度是 400-500（某些疾病应该预测）这就是我真正想要开发的。
如果没有有关您的要求和此处涉及的数据集的其他信息，您的问题将无法回答，无论采用何种方法..

标签： python machine-learning scikit-learn k-means

【解决方案1】：

不要尝试硬编码簇数的值，首先尝试使用 Elbow 方法获取簇数。一旦你得到了集群的数量，试着适应模型。这样你的预测就会更准确。获取集群的示例代码如下 -

X_std = StandardScaler().fit_transform(data)

运行kmeans的本地实现这里我们测试了3个集群

km = Kmeans(n_clusters=3, max_iter=100, random_state = 42) km.fit(X_std) centroids = km.centroids`

labels_ 相当于调用 fit(x) 然后进行预测

labels_ = km.predict(X_std) labels_

【讨论】：