【发布时间】:2020-01-05 22:28:47
【问题描述】:
我正在尝试将与索引关联的值与与其他索引关联的值进行比较,并得出百分比匹配。
我有下表:
ColumnA ColumnB
TestA A
TestA B
TestA C
TestA D
TestB D
TestB E
TestC C
TestC B
TestC E
TestD A
Index TestA has values A,B,C,D when compared to Index B which has values D,E we can see only 1 value matches out of possible 5(A,B,C,D,E). Hence match in 20%.
Index TestA has values A,B,C,D when compared to Index C which has values C,B,E we can see only 2 value matches out of possible 5(A,B,C,D,E). Hence match in 40%.
Index TestA has values A,B,C,D when compared to Index D which has values A we can see only 1 value matches out of possible 4(A,B,C,D). Hence match in 25%.
Index TestB has values D,E when compared to Index A which has values A,B,C,D we can see only 1 value matches out of possible 5(A,B,C,D,E). Hence match in 20%.
Index TestB has values D,E when compared to Index C which has values C,B,E we can see only 1 value matches out of possible 1(B,C,D,E). Hence match in 25%.
....等等......
想法是以矩阵格式显示数据:
TestA TestB TestC TestD
TestA 100 20 40 25
TestB 20 100 25 0
TestC 40 25 100 0
TestD 25 0 0 100
我编写的基本代码是遍历值。
import pandas as pd
from pyexcelerate import Workbook
import numpy as np
import time
start = time.process_time()
excel_file = 'Test.xlsx'
df = pd.read_excel(excel_file, sheet_name=1, index_col=0)
mylist = list(set(df.index.get_values()))
mylist.sort()
for i in mylist:
for j in mylist:
L1 = df.loc[i].get_values()
L2 = df.loc[j].get_values()
L3 = []
print(i,j)
for m in L1:
if not m in L3:
L3.append(m)
for n in L2:
if not n in L3:
L3.append(n)
L3.sort()
if i == j:
print(len(L1)/len(L3)*100)
else:
n = 0
for k in L1:
for l in L2:
if k == l:
n = n+1
print(n/len(L3)*100)
print(time.process_time() - start)
如何从这里计算百分比并以我希望显示的矩阵格式显示数据。
EDIT1:更新了代码,因为我现在可以计算百分比。我正在寻找一种以矩阵格式打印此数据的方法。
EDIT2:原始数据集在 A 列中大约有 10k 个奇数唯一条目,在 B 列中有 15K 个奇数唯一条目。工作表中的总行数约为 40。不确定这是否有区别。只是认为它会提供一些上下文。
【问题讨论】:
标签: python python-3.x pandas numpy matrix