缺少数据的python scikit-learn聚类答案

【问题标题】：python scikit-learn clustering with missing data缺少数据的python scikit-learn聚类
【发布时间】：2016-06-07 07:01:19
【问题描述】：

我想对缺少列的数据进行聚类。手动执行此操作，我会在没有此列的情况下计算缺少列的距离。

使用 scikit-learn，丢失数据是不可能的。也没有机会指定用户距离函数。

是否有机会在缺失数据的情况下进行聚类？

示例数据：

n_samples = 1500
noise = 0.05  
X, _ = make_swiss_roll(n_samples, noise)

rnd = np.random.rand(X.shape[0],X.shape[1]) 
X[rnd<0.1] = np.nan

【问题讨论】：

我想您可以通过为丢失的数据分配一个特定的值来处理它们。通常，取中位数或平均值作为替代。这可能看起来很奇怪，但实际上是非常标准的。这似乎是一个可以接受的解决方案？
我想避免分配例如全局平均值，因为这可能会破坏适当的类分配。实际上，我想使用聚类进行插补，即将聚类均值分配给缺失值而不是全局均值。
如何计算缺失值的距离？缺失值可以是任何值，因此您的距离可能很远。您应该通过平均值或与其他变量的相关性来输入缺失值。
嗯……好问题。我考虑计算一种归一化的高斯距离，即（分量的绝对距离之和）除以（分量之和的总和）。这可以对所有列进行，也可以仅对可用列进行。这是一个坏主意吗？我想例如朴素贝叶斯分类器，我也可以“跳过”缺失的列。

标签： python scikit-learn cluster-analysis missing-data

【解决方案1】：

我认为您可以使用迭代的 EM 类型算法：

将缺失值初始化为其列均值

重复直到收敛：

对填充数据进行K-means聚类

将缺失值设置为分配它们的聚类的质心坐标

实施

import numpy as np
from sklearn.cluster import KMeans

def kmeans_missing(X, n_clusters, max_iter=10):
    """Perform K-Means clustering on data with missing values.

    Args:
      X: An [n_samples, n_features] array of data to cluster.
      n_clusters: Number of clusters to form.
      max_iter: Maximum number of EM iterations to perform.

    Returns:
      labels: An [n_samples] vector of integer labels.
      centroids: An [n_clusters, n_features] array of cluster centroids.
      X_hat: Copy of X with the missing values filled in.
    """

    # Initialize missing values to their column means
    missing = ~np.isfinite(X)
    mu = np.nanmean(X, 0, keepdims=1)
    X_hat = np.where(missing, mu, X)

    for i in xrange(max_iter):
        if i > 0:
            # initialize KMeans with the previous set of centroids. this is much
            # faster and makes it easier to check convergence (since labels
            # won't be permuted on every iteration), but might be more prone to
            # getting stuck in local minima.
            cls = KMeans(n_clusters, init=prev_centroids)
        else:
            # do multiple random initializations in parallel
            cls = KMeans(n_clusters, n_jobs=-1)

        # perform clustering on the filled-in data
        labels = cls.fit_predict(X_hat)
        centroids = cls.cluster_centers_

        # fill in the missing values based on their cluster centroids
        X_hat[missing] = centroids[labels][missing]

        # when the labels have stopped changing then we have converged
        if i > 0 and np.all(labels == prev_labels):
            break

        prev_labels = labels
        prev_centroids = cls.cluster_centers_

    return labels, centroids, X_hat

假数据示例

from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def make_fake_data(fraction_missing, n_clusters=5, n_samples=1500,
                   n_features=3, seed=None):
    # complete data
    gen = np.random.RandomState(seed)
    X, true_labels = make_blobs(n_samples, n_features, n_clusters,
                                random_state=gen)
    # with missing values
    missing = gen.rand(*X.shape) < fraction_missing
    Xm = np.where(missing, np.nan, X)
    return X, true_labels, Xm


X, true_labels, Xm = make_fake_data(fraction_missing=0.3, n_clusters=5, seed=0)
labels, centroids, X_hat = kmeans_missing(Xm, n_clusters=5)

# plot the inferred points, color-coded according to the true cluster labels
fig, ax = plt.subplots(1, 2, subplot_kw={'projection':'3d', 'aspect':'equal'})
ax[0].scatter3D(X[:, 0], X[:, 1], X[:, 2], c=true_labels, cmap='gist_rainbow')
ax[1].scatter3D(X_hat[:, 0], X_hat[:, 1], X_hat[:, 2], c=true_labels,
                cmap='gist_rainbow')
ax[0].set_title('Original data')
ax[1].set_title('Imputed (30% missing values)')
fig.tight_layout()

基准测试

为了评估算法的性能，我们可以在真实和推断的集群标签之间使用adjusted mutual information。 1分代表完美表现，0分代表机会：

from sklearn.metrics import adjusted_mutual_info_score

fraction = np.arange(0.0, 1.0, 0.05)
n_repeat = 10
scores = np.empty((2, fraction.shape[0], n_repeat))
for i, frac in enumerate(fraction):
    for j in range(n_repeat):
        X, true_labels, Xm = make_fake_data(fraction_missing=frac, n_clusters=5)
        labels, centroids, X_hat = kmeans_missing(Xm, n_clusters=5)
        any_missing = np.any(~np.isfinite(Xm), 1)
        scores[0, i, j] = adjusted_mutual_info_score(labels, true_labels)
        scores[1, i, j] = adjusted_mutual_info_score(labels[any_missing],
                                                     true_labels[any_missing])

fig, ax = plt.subplots(1, 1)
scores_all, scores_missing = scores
ax.errorbar(fraction * 100, scores_all.mean(-1),
            yerr=scores_all.std(-1), label='All labels')
ax.errorbar(fraction * 100, scores_missing.mean(-1),
            yerr=scores_missing.std(-1),
            label='Labels with missing values')
ax.set_xlabel('% missing values')
ax.set_ylabel('Adjusted mutual information')
ax.legend(loc='best', frameon=False)
ax.set_ylim(0, 1)
ax.set_xlim(-5, 100)

更新：

事实上，在谷歌快速搜索之后，我发现上面的内容与用于缺失数据的 K-means 聚类的 k-POD 算法@987654324 几乎相同@。

【讨论】：

好的，这似乎与我的想法非常接近（有点困惑）。谢谢，我会试试这个。并感谢您对 k-POD 算法的提示。
有什么理由两组在情节中翻转颜色？还是那是偶然的？
@zelite 颜色由集群标签确定，这些标签以任意顺序设置。实际上，对原始数据和估算数据使用同一组标签可能更清楚。如果我今天晚些时候有时间，我可能会改变它。
@Cupitor 那是作弊:-)。如果我根据labels_hat 对估算点进行着色，那么每个 blob 中点的颜色将保证是同质的。此外，由于推断集群的标签是随机初始化的，因此“真实”和估算的集群标签之间的映射是任意的。例如，顶部集群在原始数据中可能具有标签 3，但在估算数据中可能具有标签 1。这将导致斑点的颜色被随机打乱，从而使图形更难解释。
@Cupitor 1) 是的，KMeans 对集群初始化进行小批量处理。如果我们显式地设置初始集群质心，那么 n_jobs 参数什么也不做。 2）我猜你可能只是内存不足。我必须深入研究 sklearn 的源代码才能确定，但大多数 k-means 实现使用 O(n + kd) 内存，其中 n 是样本，k 是要查找的簇数，d 是特征空间的维数。所以内存需求会随着特征的数量成倍增加。

【解决方案2】：

这是我使用的另一种算法。不是替换缺失值，而是忽略这些值，为了捕捉缺失和非缺失之间的差异，我隐含了缺失假人。

与 Alis 算法相比，缺少观测值的观测似乎更容易从一个类跳到另一个类。因为我没有填写缺失值。

幸运的是，我没有时间比较使用阿里的漂亮代码，但随意做（我有时间可能会做）并为讨论最佳方法做出贡献。

import numpy as np
class kmeans_missing(object):
    def __init__(self,potential_centroids,n_clusters):
        #initialize with potential centroids
        self.n_clusters=n_clusters
        self.potential_centroids=potential_centroids
    def fit(self,data,max_iter=10,number_of_runs=1):
        n_clusters=self.n_clusters
        potential_centroids=self.potential_centroids

        dist_mat=np.zeros((data.shape[0],n_clusters))
        all_centroids=np.zeros((n_clusters,data.shape[1],number_of_runs))

        costs=np.zeros((number_of_runs,))
        for k in range(number_of_runs):
            idx=np.random.choice(range(potential_centroids.shape[0]), size=(n_clusters), replace=False)
            centroids=potential_centroids[idx]
            clusters=np.zeros(data.shape[0])
            old_clusters=np.zeros(data.shape[0])
            for i in range(max_iter):
                #Calc dist to centroids
                for j in range(n_clusters):
                    dist_mat[:,j]=np.nansum((data-centroids[j])**2,axis=1)
                #Assign to clusters
                clusters=np.argmin(dist_mat,axis=1)
                #Update clusters
                for j in range(n_clusters):
                    centroids[j]=np.nanmean(data[clusters==j],axis=0)
                if all(np.equal(clusters,old_clusters)):
                    break # Break when to change in clusters
                if i==max_iter-1:
                    print('no convergence before maximal iterations are reached')
                else:
                    clusters,old_clusters=old_clusters,clusters

            all_centroids[:,:,k]=centroids
            costs[k]=np.mean(np.min(dist_mat,axis=1))
        self.costs=costs
        self.cost=np.min(costs)
        self.best_model=np.argmin(costs)
        self.centroids=all_centroids[:,:,self.best_model]
        self.all_centroids=all_centroids
    def predict(self,data):
        dist_mat=np.zeros((data.shape[0],self.n_clusters))
        for j in range(self.n_clusters):
            dist_mat[:,j]=np.nansum((data-self.centroids[j])**2,axis=1)
        prediction=np.argmin(dist_mat,axis=1)
        cost=np.min(dist_mat,axis=1)
        return prediction,cost

这里有一个例子说明它可能有用。

from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from kmeans_missing import *

def make_fake_data(fraction_missing, n_clusters=5, n_samples=1500,
                   n_features=2, seed=None):
    # complete data
    gen = np.random.RandomState(seed)
    X, true_labels = make_blobs(n_samples, n_features, n_clusters,
                                random_state=gen)
    # with missing values
    missing = gen.rand(*X.shape) < fraction_missing
    Xm = np.where(missing, np.nan, X)
    return X, true_labels, Xm
X, true_labels, X_hat = make_fake_data(fraction_missing=0.3, n_clusters=3, seed=0)
X_missing_dummies=np.isnan(X_hat)
n_clusters=3
X_hat = np.concatenate((X_hat,X_missing_dummies),axis=1)
kmeans_m=kmeans_missing(X_hat,n_clusters)
kmeans_m.fit(X_hat,max_iter=100,number_of_runs=10)
print(kmeans_m.costs)
prediction,cost=kmeans_m.predict(X_hat)

for i in range(n_clusters):
    print([np.mean((prediction==i)*(true_labels==j)) for j in range(3)],np.mean((prediction==i)))

--编辑--

在这个例子中，缺失值的出现是完全随机的，而且在这种情况下也是如此。不添加缺失值虚拟变量会更好，因为在这种情况下缺失值虚拟变量是噪声。为了与阿里的算法进行比较，不包括它们也是正确的做法。

【讨论】：