【问题标题】:Python scikit-learn implementation of mutual information not working for partitions of different size互信息的Python scikit-learn实现不适用于不同大小的分区
【发布时间】:2026-02-15 18:05:01
【问题描述】:

我想与一组不同大小的 S 的分区/集群(P1 和 P2)进行比较。示例:

S = [1, 2, 3, 4, 5, 6]
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]

根据我的阅读,互信息可能是一种方法,它在 scikit-learn 中实现。根据定义,它没有说明分区必须具有相同的大小(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html).l

但是,当我尝试实现我的代码时,由于大小不同而出现错误。

from sklearn import metrics
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]
metrics.mutual_info_score(P1,P2)


ValueErrorTraceback (most recent call last)
<ipython-input-183-d5cb8d32ce7d> in <module>()
      2 P2 = [ [1,2,3,4], [5, 6]]
      3 
----> 4 metrics.mutual_info_score(P1,P2)

/home/user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/cluster/supervised.pyc in mutual_info_score(labels_true, labels_pred, contingency)
    556     """
    557     if contingency is None:
--> 558         labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
    559         contingency = contingency_matrix(labels_true, labels_pred)
    560     contingency = np.array(contingency, dtype='float')

/home/user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/cluster/supervised.pyc in check_clusterings(labels_true, labels_pred)
     34     if labels_true.ndim != 1:
     35         raise ValueError(
---> 36             "labels_true must be 1D: shape is %r" % (labels_true.shape,))
     37     if labels_pred.ndim != 1:
     38         raise ValueError(

ValueError: labels_true must be 1D: shape is (3, 2)

有没有一种表格可以使用 scikit-learn 和互信息来查看这些分区的接近程度?否则,有没有不使用互信息的?

【问题讨论】:

    标签: python scikit-learn


    【解决方案1】:

    错误是信息传递给函数的形式。正确的形式是为要分区的全局集的每个元素提供一个标签列表。在这种情况下,S 中的每个元素都有一个标签。每个标签应该对应于它所属的集群,因此具有相同标签的元素在同一个集群中。解决例子:

    S = [1, 2, 3, 4, 5, 6]
    P1 = [[1, 2], [3,4], [5,6]]
    P2 = [ [1,2,3,4], [5, 6]]
    labs_1 = [ 1, 1, 2, 2, 3, 3]
    labs_2 = [1, 1, 1, 1, 2, 2]
    metrics.mutual_info_score(labs_1, labs_2)
    

    答案是:

    0.636514168294813
    

    如果我们想计算最初给出的分区格式的互信息,那么可以使用以下代码:

    from sklearn import metrics
    from __future__ import division
    import numpy as np
    
    S = [1, 2, 3, 4, 5, 6]
    P1 = [[1, 2], [3,4], [5,6]]
    P2 = [ [1,2,3,4], [5, 6]]
    set_partition1 = [set(p) for p in P1]
    set_partition2 = [set(p) for p in P2]
    
    def prob_dist(clustering, cluster, N):
        return len(clustering[cluster])/N
    
    def prob_joint_dist(clustering1, clustering2, cluster1, cluster2, N):
        '''
        N(int): total number of elements.
        clustering1(list): first partition
        clustering2(list): second partition
        cluster1(int): index of cluster of the first partition
        cluster2(int): index of cluster of second partition
        '''
        c1 = clustering1[cluster1]
        c2 = clustering2[cluster2]
        n_ij = len(set(c1).intersection(c2))
        return n_ij/N
    
    def mutual_info(clustering1, clustering2, N):
        '''
        clustering1(list): first partition
        clustering2(list): second partition
        Note for it to work division from  __future__ must be imported
        '''
        n_clas = len(clustering1)
        n_com = len(clustering2)
        mutual_info = 0
        for i in range(n_clas):
            for j in range(n_com):
                p_i = prob_dist(clustering1, i, N)
                p_j = prob_dist(clustering2, j, N)
                R_ij = prob_joint_dist(clustering1, clustering2, i, j, N)
                if R_ij > 0:
                    mutual_info += R_ij*np.log( R_ij / (p_i * p_j))
        return mutual_info
    
    mutual_info(set_partition1, set_partition2, len(S))
    

    给出了相同的答案:

    0.63651416829481278
    

    请注意,我们使用的是自然对数,而不是 log2。不过,代码可以很容易地修改。

    【讨论】:

      最近更新 更多