Pandas 将索引值与相应的索引值进行比较以找到百分比匹配答案

【问题标题】：Pandas compare index value to corresponding index value to find a percentage matchPandas 将索引值与相应的索引值进行比较以找到百分比匹配
【发布时间】：2020-01-05 22:28:47
【问题描述】：

我正在尝试将与索引关联的值与与其他索引关联的值进行比较，并得出百分比匹配。

我有下表：

 ColumnA    ColumnB
 TestA      A
 TestA      B
 TestA      C
 TestA      D
 TestB      D
 TestB      E
 TestC      C
 TestC      B
 TestC      E
 TestD      A


Index TestA has values A,B,C,D when compared to Index B which has values D,E we can see only 1 value matches out of possible 5(A,B,C,D,E). Hence match in 20%.

Index TestA has values A,B,C,D when compared to Index C which has values C,B,E we can see only 2 value matches out of possible 5(A,B,C,D,E). Hence match in 40%.

Index TestA has values A,B,C,D when compared to Index D which has values A we can see only 1 value matches out of possible 4(A,B,C,D). Hence match in 25%.

Index TestB has values D,E when compared to Index A which has values A,B,C,D  we can see only 1 value matches out of possible 5(A,B,C,D,E). Hence match in 20%.

Index TestB has values D,E when compared to Index C which has values C,B,E  we can see only 1 value matches out of possible 1(B,C,D,E). Hence match in 25%.

....等等......

想法是以矩阵格式显示数据：

       TestA    TestB   TestC   TestD
TestA   100     20      40       25
TestB   20      100     25       0
TestC   40      25      100      0
TestD   25      0       0       100

我编写的基本代码是遍历值。

import pandas as pd
from pyexcelerate import Workbook
import numpy as np
import time
start = time.process_time()
excel_file = 'Test.xlsx'
df = pd.read_excel(excel_file, sheet_name=1, index_col=0)
mylist = list(set(df.index.get_values()))
mylist.sort()
for i in mylist:
    for j in mylist:
        L1 = df.loc[i].get_values()
        L2 = df.loc[j].get_values()
        L3 = []
        print(i,j)
        for m in L1:
                if not m in L3:
                    L3.append(m)
                    for n in L2:
                        if not n in L3:
                            L3.append(n)
        L3.sort()
        if i == j:
            print(len(L1)/len(L3)*100)
        else:
            n = 0
            for k in L1:
                for l in L2:
                    if k == l:
                        n = n+1
            print(n/len(L3)*100)
print(time.process_time() - start)

如何从这里计算百分比并以我希望显示的矩阵格式显示数据。

EDIT1：更新了代码，因为我现在可以计算百分比。我正在寻找一种以矩阵格式打印此数据的方法。

EDIT2：原始数据集在 A 列中大约有 10k 个奇数唯一条目，在 B 列中有 15K 个奇数唯一条目。工作表中的总行数约为 40。不确定这是否有区别。只是认为它会提供一些上下文。

【问题讨论】：

标签： python python-3.x pandas numpy matrix

【解决方案1】：

我向你展示我找到的解决方案：

我已将 df 命名为：

  df
      ColumnA ColumnB
    0   TestA       A
    1   TestA       B
    2   TestA       C
    3   TestA       D
    4   TestB       D
    5   TestB       E
    6   TestC       C
    7   TestC       B
    8   TestC       E
    9   TestD       A

代码：

M=pd.DataFrame(columns=df['ColumnA'].unique().tolist(),index=df['ColumnA'].unique().tolist())
j=len(df['ColumnA'].unique().tolist())
for i in range(len(df['ColumnA'].unique().tolist())):
    my_list=[]
    for k in range(j):
        t1=df.loc[df['ColumnA'].eq(df['ColumnA'].unique().tolist()[i])]['ColumnB']
        t2=df.loc[df['ColumnA'].eq(df['ColumnA'].unique().tolist()[i+k])]['ColumnB']
        M.iloc[i,i+k]=100*t1.isin(t2).sum()/len(pd.concat([t1,t2]).drop_duplicates())
        M.iloc[i+k,i]=100*t1.isin(t2).sum()/len(pd.concat([t1,t2]).drop_duplicates())
    j-=1

输出M:

       TestA  TestB  TestC  TestD
TestA    100     20     40   25.0
TestB     20    100     25    0.0
TestC     40     25    100    0.0
TestD     25      0      0  100.0

【讨论】：

该代码绝对适用于小型数据集。我正在尝试在更大的数据集上运行它以检查问题。

【解决方案2】：

您可以使用 itertools 计算所有唯一 Col A 的乘积，然后计算 pct 并构建新的 df：

from itertools import product

# for each unique element in colA, build a list of unique elements from ColB
g = (
    df.groupby('ColumnA').ColumnB
    .apply(lambda x: x.values.tolist())
)

# generate a combination of all the lists 
prod = list(product(g, repeat=2))

data = (
    #for each pair of lists, find the number of common elements,
    #then divide by the union of 2 lists. This gives you the pct.
    np.array([len(set(e[0]).intersection(e[1]))/len(set(e[0]).union(e[1])) for e in prod])
    .reshape(len(g), -1)
)

pd.DataFrame(data*100, index=g.index.tolist(), columns=g.index.tolist())

        TestA   TestB   TestC   TestD
TestA   100.0   20.0    40.0    25.0
TestB   20.0    100.0   25.0    0.0
TestC   40.0    25.0    100.0   0.0
TestD   25.0    0.0     0.0     100.0

【讨论】：

该代码绝对适用于小型数据集。我正在尝试在更大的数据集上运行它以检查问题。
这肯定会比嵌套的 for 循环解决方案快得多。让我知道它在您的完整数据集上的运行情况。
同意这比循环解决方案更快。我的原始 excel 中仍然没有完成读取 40k 奇数记录（大约 1 小时过去了）。会让你知道我的进展情况。
这太棒了。我会接受答案。我可以用几行代码来解释我的学习代码中完成的一些复杂内容吗？
不用担心。我刚刚在代码中添加了一些 cmets。如果你逐行运行代码，它会更容易理解。