【问题标题】:TSNE from **sklearn** with **mahalanobis** metric来自 **sklearn** 的 TSNE 与 **mahalanobis** 指标
【发布时间】:2019-01-16 21:58:39
【问题描述】:

使用 sklearn 的 TSNE 和 ma​​halanobis 指标,我收到以下错误

from sklearn.manifold import TSNE      
tsne = TSNE( verbose=1, perplexity=40, n_iter=250,learning_rate=50, random_state=0,metric='mahalanobis')
pt=data.sample(frac=0.1).values
tsne_results = tsne.fit_transform(pt)

ValueError: Must provide either V or VI for Mahalanobis distance

如何为马氏距离提供method_parameters?

【问题讨论】:

    标签: python python-3.x scikit-learn


    【解决方案1】:

    确实没有像在其他情况下那样定义 metric_params 的选项。例如,其他基于成对距离的类提供了一个metric_params 参数来将额外的参数传递给距离函数。喜欢

    有这个:

    metric_params : dict, optional (default = None)
    
        Additional keyword arguments for the metric function.
    

    这个answer here 展示了如何使用这个参数。

    但是 TSNE 没有办法发送额外的参数。所以现在,您需要扩展类并覆盖 __init__() 以发送参数,然后 _fit() method 实际使用它们。

    我们可以这样做:

    from time import time
    import numpy as np
    import scipy.sparse as sp
    from sklearn.manifold import TSNE
    from sklearn.externals.six import string_types
    from sklearn.utils import check_array, check_random_state
    from sklearn.metrics.pairwise import pairwise_distances
    from sklearn.manifold.t_sne import _joint_probabilities, _joint_probabilities_nn
    from sklearn.neighbors import NearestNeighbors
    from sklearn.decomposition import PCA
    
    class MyTSNE(TSNE):
        def __init__(self, n_components=2, perplexity=30.0,
                     early_exaggeration=12.0, learning_rate=200.0, n_iter=1000,
                     n_iter_without_progress=300, min_grad_norm=1e-7,
                     metric="euclidean", metric_params=None, #<=ADDED
                     init="random", verbose=0,
                     random_state=None, method='barnes_hut', angle=0.5):
            self.n_components = n_components
            self.perplexity = perplexity
            self.early_exaggeration = early_exaggeration
            self.learning_rate = learning_rate
            self.n_iter = n_iter
            self.n_iter_without_progress = n_iter_without_progress
            self.min_grad_norm = min_grad_norm
            self.metric = metric
            self.metric_params = metric_params  #<=ADDED
            self.init = init
            self.verbose = verbose
            self.random_state = random_state
            self.method = method
            self.angle = angle
    
        def _fit(self, X, skip_num_points=0):
            if self.method not in ['barnes_hut', 'exact']:
                raise ValueError("'method' must be 'barnes_hut' or 'exact'")
            if self.angle < 0.0 or self.angle > 1.0:
                raise ValueError("'angle' must be between 0.0 - 1.0")
            if self.metric == "precomputed":
                if isinstance(self.init, string_types) and self.init == 'pca':
                    raise ValueError("The parameter init=\"pca\" cannot be "
                                     "used with metric=\"precomputed\".")
                if X.shape[0] != X.shape[1]:
                    raise ValueError("X should be a square distance matrix")
                if np.any(X < 0):
                    raise ValueError("All distances should be positive, the "
                                     "precomputed distances given as X is not "
                                     "correct")
            if self.method == 'barnes_hut' and sp.issparse(X):
                raise TypeError('A sparse matrix was passed, but dense '
                                'data is required for method="barnes_hut". Use '
                                'X.toarray() to convert to a dense numpy array if '
                                'the array is small enough for it to fit in '
                                'memory. Otherwise consider dimensionality '
                                'reduction techniques (e.g. TruncatedSVD)')
            else:
                X = check_array(X, accept_sparse=['csr', 'csc', 'coo'],
                                dtype=[np.float32, np.float64])
            if self.method == 'barnes_hut' and self.n_components > 3:
                raise ValueError("'n_components' should be inferior to 4 for the "
                                 "barnes_hut algorithm as it relies on "
                                 "quad-tree or oct-tree.")
            random_state = check_random_state(self.random_state)
    
            if self.early_exaggeration < 1.0:
                raise ValueError("early_exaggeration must be at least 1, but is {}"
                                 .format(self.early_exaggeration))
    
            if self.n_iter < 250:
                raise ValueError("n_iter should be at least 250")
    
            n_samples = X.shape[0]
    
            neighbors_nn = None
            if self.method == "exact":
                if self.metric == "precomputed":
                    distances = X
                else:
                    if self.verbose:
                        print("[t-SNE] Computing pairwise distances...")
    
                    if self.metric == "euclidean":
                        distances = pairwise_distances(X, metric=self.metric,
                                                       squared=True,
                                                       **self.metric_params) #<=ADDED
                    else:
                        distances = pairwise_distances(X, metric=self.metric,
                                                       **self.metric_params) #<=ADDED
    
                    if np.any(distances < 0):
                        raise ValueError("All distances should be positive, the "
                                         "metric given is not correct")
    
                P = _joint_probabilities(distances, self.perplexity, self.verbose)
                assert np.all(np.isfinite(P)), "All probabilities should be finite"
                assert np.all(P >= 0), "All probabilities should be non-negative"
                assert np.all(P <= 1), ("All probabilities should be less "
                                        "or then equal to one")
    
            else:
                k = min(n_samples - 1, int(3. * self.perplexity + 1))
    
                if self.verbose:
                    print("[t-SNE] Computing {} nearest neighbors...".format(k))
    
                knn = NearestNeighbors(algorithm='auto', n_neighbors=k,
                                       metric=self.metric, 
                                       metric_params = self.metric_params) #<=ADDED
                t0 = time()
                knn.fit(X)
                duration = time() - t0
                if self.verbose:
                    print("[t-SNE] Indexed {} samples in {:.3f}s...".format(
                        n_samples, duration))
    
                t0 = time()
                distances_nn, neighbors_nn = knn.kneighbors(
                    None, n_neighbors=k)
                duration = time() - t0
                if self.verbose:
                    print("[t-SNE] Computed neighbors for {} samples in {:.3f}s..."
                          .format(n_samples, duration))
    
                del knn
    
                if self.metric == "euclidean":
                    distances_nn **= 2
    
                P = _joint_probabilities_nn(distances_nn, neighbors_nn,
                                            self.perplexity, self.verbose)
    
            if isinstance(self.init, np.ndarray):
                X_embedded = self.init
            elif self.init == 'pca':
                pca = PCA(n_components=self.n_components, svd_solver='randomized',
                          random_state=random_state)
                X_embedded = pca.fit_transform(X).astype(np.float32, copy=False)
            elif self.init == 'random':
                X_embedded = 1e-4 * random_state.randn(
                    n_samples, self.n_components).astype(np.float32)
            else:
                raise ValueError("'init' must be 'pca', 'random', or "
                                 "a numpy array")
    
            degrees_of_freedom = max(self.n_components - 1.0, 1)
    
            return self._tsne(P, degrees_of_freedom, n_samples,
                              X_embedded=X_embedded,
                              neighbors=neighbors_nn,
                              skip_num_points=skip_num_points)
    

    我已在更改上标记 (#

    tsne = MyTSNE(verbose=1,perplexity=40,n_iter=250,learning_rate=50, random_state=0,
                  metric='mahalanobis', metric_params={'V': np.cov(X)})
    
    pt=data.sample(frac=0.1).values
    tsne_results = tsne.fit_transform(pt)
    

    注意: 我在顶部提到的其他类检查metric_params 是否有有效参数,但我没有这样做,所以请确保在其中传递正确的参数,否则会出错。

    您应该将问题发布到scikit-learn issues page on github

    【讨论】:

    • 太棒了!!!只是少量评论将行更改为 metric=self.metric, metric_params=self.metric_params) #
    • @Arman 啊,是的。我从pairwise_distances 的上述用法中复制粘贴了它,其中参数是关键字参数。我忘记了NearestNeighbors 需要字典,而不是kwargs。谢谢。改变了
    • @Arman 是的,我看到并评论了一些关于问题根本原因的信息。
    猜你喜欢
    • 2016-08-20
    • 2017-07-21
    • 2018-12-01
    • 2019-03-26
    • 2019-05-22
    • 2019-11-08
    • 2017-11-07
    • 1970-01-01
    • 2011-05-28
    相关资源
    最近更新 更多