【问题标题】:python scikit-learn clustering with missing data缺少数据的python scikit-learn聚类
【发布时间】:2016-06-07 07:01:19
【问题描述】:

我想对缺少列的数据进行聚类。手动执行此操作,我会在没有此列的情况下计算缺少列的距离。

使用 scikit-learn,丢失数据是不可能的。也没有机会指定用户距离函数。

是否有机会在缺失数据的情况下进行聚类?

示例数据:

n_samples = 1500
noise = 0.05  
X, _ = make_swiss_roll(n_samples, noise)

rnd = np.random.rand(X.shape[0],X.shape[1]) 
X[rnd<0.1] = np.nan

【问题讨论】:

  • 我想您可以通过为丢失的数据分配一个特定的值来处理它们。通常,取中位数或平均值作为替代。这可能看起来很奇怪,但实际上是非常标准的。这似乎是一个可以接受的解决方案?
  • 我想避免分配例如全局平均值,因为这可能会破坏适当的类分配。实际上,我想使用聚类进行插补,即将聚类均值分配给缺失值而不是全局均值。
  • 如何计算缺失值的距离?缺失值可以是任何值,因此您的距离可能很远。您应该通过平均值或与其他变量的相关性来输入缺失值。
  • 嗯……好问题。我考虑计算一种归一化的高斯距离,即(分量的绝对距离之和)除以(分量之和的总和)。这可以对所有列进行,也可以仅对可用列进行。这是一个坏主意吗?我想例如朴素贝叶斯分类器,我也可以“跳过”缺失的列。

标签: python scikit-learn cluster-analysis missing-data


【解决方案1】:

我认为您可以使用迭代的 EM 类型算法:

将缺失值初始化为其列均值

重复直到收敛:

  • 对填充数据进行K-means聚类

  • 将缺失值设置为分配它们的聚类的质心坐标

实施

import numpy as np
from sklearn.cluster import KMeans

def kmeans_missing(X, n_clusters, max_iter=10):
    """Perform K-Means clustering on data with missing values.

    Args:
      X: An [n_samples, n_features] array of data to cluster.
      n_clusters: Number of clusters to form.
      max_iter: Maximum number of EM iterations to perform.

    Returns:
      labels: An [n_samples] vector of integer labels.
      centroids: An [n_clusters, n_features] array of cluster centroids.
      X_hat: Copy of X with the missing values filled in.
    """

    # Initialize missing values to their column means
    missing = ~np.isfinite(X)
    mu = np.nanmean(X, 0, keepdims=1)
    X_hat = np.where(missing, mu, X)

    for i in xrange(max_iter):
        if i > 0:
            # initialize KMeans with the previous set of centroids. this is much
            # faster and makes it easier to check convergence (since labels
            # won't be permuted on every iteration), but might be more prone to
            # getting stuck in local minima.
            cls = KMeans(n_clusters, init=prev_centroids)
        else:
            # do multiple random initializations in parallel
            cls = KMeans(n_clusters, n_jobs=-1)

        # perform clustering on the filled-in data
        labels = cls.fit_predict(X_hat)
        centroids = cls.cluster_centers_

        # fill in the missing values based on their cluster centroids
        X_hat[missing] = centroids[labels][missing]

        # when the labels have stopped changing then we have converged
        if i > 0 and np.all(labels == prev_labels):
            break

        prev_labels = labels
        prev_centroids = cls.cluster_centers_

    return labels, centroids, X_hat

假数据示例

from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def make_fake_data(fraction_missing, n_clusters=5, n_samples=1500,
                   n_features=3, seed=None):
    # complete data
    gen = np.random.RandomState(seed)
    X, true_labels = make_blobs(n_samples, n_features, n_clusters,
                                random_state=gen)
    # with missing values
    missing = gen.rand(*X.shape) < fraction_missing
    Xm = np.where(missing, np.nan, X)
    return X, true_labels, Xm


X, true_labels, Xm = make_fake_data(fraction_missing=0.3, n_clusters=5, seed=0)
labels, centroids, X_hat = kmeans_missing(Xm, n_clusters=5)

# plot the inferred points, color-coded according to the true cluster labels
fig, ax = plt.subplots(1, 2, subplot_kw={'projection':'3d', 'aspect':'equal'})
ax[0].scatter3D(X[:, 0], X[:, 1], X[:, 2], c=true_labels, cmap='gist_rainbow')
ax[1].scatter3D(X_hat[:, 0], X_hat[:, 1], X_hat[:, 2], c=true_labels,
                cmap='gist_rainbow')
ax[0].set_title('Original data')
ax[1].set_title('Imputed (30% missing values)')
fig.tight_layout()

基准测试

为了评估算法的性能,我们可以在真实和推断的集群标签之间使用adjusted mutual information。 1分代表完美表现,0分代表机会:

from sklearn.metrics import adjusted_mutual_info_score

fraction = np.arange(0.0, 1.0, 0.05)
n_repeat = 10
scores = np.empty((2, fraction.shape[0], n_repeat))
for i, frac in enumerate(fraction):
    for j in range(n_repeat):
        X, true_labels, Xm = make_fake_data(fraction_missing=frac, n_clusters=5)
        labels, centroids, X_hat = kmeans_missing(Xm, n_clusters=5)
        any_missing = np.any(~np.isfinite(Xm), 1)
        scores[0, i, j] = adjusted_mutual_info_score(labels, true_labels)
        scores[1, i, j] = adjusted_mutual_info_score(labels[any_missing],
                                                     true_labels[any_missing])

fig, ax = plt.subplots(1, 1)
scores_all, scores_missing = scores
ax.errorbar(fraction * 100, scores_all.mean(-1),
            yerr=scores_all.std(-1), label='All labels')
ax.errorbar(fraction * 100, scores_missing.mean(-1),
            yerr=scores_missing.std(-1),
            label='Labels with missing values')
ax.set_xlabel('% missing values')
ax.set_ylabel('Adjusted mutual information')
ax.legend(loc='best', frameon=False)
ax.set_ylim(0, 1)
ax.set_xlim(-5, 100)

更新:

事实上,在谷歌快速搜索之后,我发现上面的内容与用于缺失数据的 K-means 聚类的 k-POD 算法@987654324 几乎相同@。

【讨论】:

  • 好的,这似乎与我的想法非常接近(有点困惑)。谢谢,我会试试这个。并感谢您对 k-POD 算法的提示。
  • 有什么理由两组在情节中翻转颜色?还是那是偶然的?
  • @zelite 颜色由集群标签确定,这些标签以任意顺序设置。实际上,对原始数据和估算数据使用同一组标签可能更清楚。如果我今天晚些时候有时间,我可能会改变它。
  • @Cupitor 那是作弊:-)。如果我根据labels_hat 对估算点进行着色,那么每个 blob 中点的颜色将保证是同质的。此外,由于推断集群的标签是随机初始化的,因此“真实”和估算的集群标签之间的映射是任意的。例如,顶部集群在原始数据中可能具有标签 3,但在估算数据中可能具有标签 1。这将导致斑点的颜色被随机打乱,从而使图形更难解释。
  • @Cupitor 1) 是的,KMeans 对集群初始化进行小批量处理。如果我们显式地设置初始集群质心,那么 n_jobs 参数什么也不做。 2)我猜你可能只是内存不足。我必须深入研究 sklearn 的源代码才能确定,但​​大多数 k-means 实现使用 O(n + kd) 内存,其中 n 是样本,k 是要查找的簇数,d 是特征空间的维数。所以内存需求会随着特征的数量成倍增加。
【解决方案2】:

这是我使用的另一种算法。不是替换缺失值,而是忽略这些值,为了捕捉缺失和非缺失之间的差异,我隐含了缺失假人。

与 Alis 算法相比,缺少观测值的观测似乎更容易从一个类跳到另一个类。因为我没有填写缺失值。

幸运的是,我没有时间比较使用阿里的漂亮代码,但随意做(我有时间可能会做)并为讨论最佳方法做出贡献。

import numpy as np
class kmeans_missing(object):
    def __init__(self,potential_centroids,n_clusters):
        #initialize with potential centroids
        self.n_clusters=n_clusters
        self.potential_centroids=potential_centroids
    def fit(self,data,max_iter=10,number_of_runs=1):
        n_clusters=self.n_clusters
        potential_centroids=self.potential_centroids

        dist_mat=np.zeros((data.shape[0],n_clusters))
        all_centroids=np.zeros((n_clusters,data.shape[1],number_of_runs))

        costs=np.zeros((number_of_runs,))
        for k in range(number_of_runs):
            idx=np.random.choice(range(potential_centroids.shape[0]), size=(n_clusters), replace=False)
            centroids=potential_centroids[idx]
            clusters=np.zeros(data.shape[0])
            old_clusters=np.zeros(data.shape[0])
            for i in range(max_iter):
                #Calc dist to centroids
                for j in range(n_clusters):
                    dist_mat[:,j]=np.nansum((data-centroids[j])**2,axis=1)
                #Assign to clusters
                clusters=np.argmin(dist_mat,axis=1)
                #Update clusters
                for j in range(n_clusters):
                    centroids[j]=np.nanmean(data[clusters==j],axis=0)
                if all(np.equal(clusters,old_clusters)):
                    break # Break when to change in clusters
                if i==max_iter-1:
                    print('no convergence before maximal iterations are reached')
                else:
                    clusters,old_clusters=old_clusters,clusters

            all_centroids[:,:,k]=centroids
            costs[k]=np.mean(np.min(dist_mat,axis=1))
        self.costs=costs
        self.cost=np.min(costs)
        self.best_model=np.argmin(costs)
        self.centroids=all_centroids[:,:,self.best_model]
        self.all_centroids=all_centroids
    def predict(self,data):
        dist_mat=np.zeros((data.shape[0],self.n_clusters))
        for j in range(self.n_clusters):
            dist_mat[:,j]=np.nansum((data-self.centroids[j])**2,axis=1)
        prediction=np.argmin(dist_mat,axis=1)
        cost=np.min(dist_mat,axis=1)
        return prediction,cost

这里有一个例子说明它可能有用。

from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from kmeans_missing import *

def make_fake_data(fraction_missing, n_clusters=5, n_samples=1500,
                   n_features=2, seed=None):
    # complete data
    gen = np.random.RandomState(seed)
    X, true_labels = make_blobs(n_samples, n_features, n_clusters,
                                random_state=gen)
    # with missing values
    missing = gen.rand(*X.shape) < fraction_missing
    Xm = np.where(missing, np.nan, X)
    return X, true_labels, Xm
X, true_labels, X_hat = make_fake_data(fraction_missing=0.3, n_clusters=3, seed=0)
X_missing_dummies=np.isnan(X_hat)
n_clusters=3
X_hat = np.concatenate((X_hat,X_missing_dummies),axis=1)
kmeans_m=kmeans_missing(X_hat,n_clusters)
kmeans_m.fit(X_hat,max_iter=100,number_of_runs=10)
print(kmeans_m.costs)
prediction,cost=kmeans_m.predict(X_hat)

for i in range(n_clusters):
    print([np.mean((prediction==i)*(true_labels==j)) for j in range(3)],np.mean((prediction==i)))

--编辑--

在这个例子中,缺失值的出现是完全随机的,而且在这种情况下也是如此。不添加缺失值虚拟变量会更好,因为在这种情况下缺失值虚拟变量是噪声。为了与阿里的算法进行比较,不包括它们也是正确的做法。

【讨论】:

    猜你喜欢
    • 2015-02-20
    • 2019-09-22
    • 2015-11-20
    • 2019-04-15
    • 2018-01-27
    • 2014-10-03
    • 2017-08-06
    • 1970-01-01
    • 2016-03-17
    相关资源
    最近更新 更多