【问题标题】:Get values from k-means cluster after clustering聚类后​​从 k-means 聚类中获取值
【发布时间】:2018-11-11 11:40:45
【问题描述】:

我有一个数据集,我在 (scikit-learn) 上运行了 K-means 算法,我想在每个集群上构建一个决策树。我可以从集群中恢复值,但不能恢复“类”值(我正在进行监督学习,每个元素都可以属于两个类之一,我需要与数据关联的值来构建我的树)

例如:未过滤的数据集:

[val1 val2 class]
X_train=[val1 val2]
y_train=[class]

聚类代码是这样的:

X = clusterDF[clusterDF.columns[clusterDF.columns.str.contains('\'AB\'')]]
y = clusterDF['Class']
(X_train, X_test, y_train, y_test) = train_test_split(X, y,
        test_size=0.30)

kmeans = KMeans(n_clusters=3, n_init=5, max_iter=3000, random_state=1)
kmeans.fit(X_train, y_train)
y_pred = kmeans.predict(X_test)

这是我(令人难以置信的笨拙!)用于提取值以构建树的代码。问题是 Y 值;它们与 X 值不一致

cl={i: np.where(kmeans.labels_ == i)[0] for i in range(kmeans.n_clusters)}
for j in range(0,len(k_means_labels_unique)):
    Xc=None
    Y=None
    #for i in range(0,len(k_means_labels_unique)):
    indexes = cl.get(j,0)
    for i, row in X.iterrows():
        if i in indexes:
            if Xc is not None:
                Xc = np.vstack([Xc, [row['first occurrence of \'AB\''],row['similarity to \'AB\'']]])
            else:
                Xc = np.array([row['first occurrence of \'AB\''],row['similarity to \'AB\'']])
            if Y is not None:
                Y = np.vstack([Y, y[i]])
            else:
                Y = np.array(y[i])
    Xc = pd.DataFrame(data=Xc, index=range(0, len(X)),
                     columns=['first occurrence of \'AB\'',
        'similarity to \'AB\''])  # 1st row as the column names


    Y = pd.DataFrame(data=Y, index=range(0, len(Y)),columns=['Class'])


        print("\n\t-----Classifier ", j + 1,"----")

        (X_train, X_test, y_train, y_test) = train_test_split(X, Y,
            test_size=0.30)

        classifier = DecisionTreeClassifier(criterion='entropy',max_depth = 2)
        classifier = getResults(
            X_train,
        y_train,
        X_test,
        y_test,
        classifier,
        filename='classif'+str(3 + i),
        )

有什么想法(或更有效的方法)可以从集群数据中生成决策树吗?

【问题讨论】:

    标签: python scikit-learn k-means decision-tree


    【解决方案1】:

    没有阅读所有代码,但我猜测将索引向量传递给train_test_split 函数将帮助您跟踪样本。

    X = clusterDF[clusterDF.columns[clusterDF.columns.str.contains('\'AB\'')]]
    y = clusterDF['Class']
    indices = clusterDF.index
    X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(X, y, indices)
    

    【讨论】:

      猜你喜欢
      • 2015-04-11
      • 2020-08-28
      • 2019-03-16
      • 2016-05-29
      • 2011-08-13
      • 2013-08-08
      • 2013-02-14
      • 2018-01-14
      • 2019-05-04
      相关资源
      最近更新 更多