带有矢量数据的 Pandas 距离矩阵性能答案

【问题标题】：Pandas distance matrix performance with vector data带有矢量数据的 Pandas 距离矩阵性能
【发布时间】：2016-03-11 10:12:37
【问题描述】：

即使我发现一些处理距离矩阵效率的线程，它们都使用 int 或 float 矩阵。在我的例子中，我必须处理向量（orderedDict of frequency），而我最终只能得到一个非常慢的方法，这种方法对于大型 DataFrame（300,000 x 300,000）是不可行的。

如何让流程更优化？

非常感谢您的帮助，这个问题一直困扰着我:)

考虑DataFrame df 如：

>>> df
    vectors
id
1   {dict1}
2   {dict2}
3   {dict3}
4   {dict4}

{dict#}在哪里

orderedDict{event1: 1,
            event2: 5,
            event3: 0,
            ...}

返回两个向量之间距离的函数：

def vectorDistance(a, b, df_vector):
    # Calculate distance between a & b
    # based on the vector from df_vector.
    return distance

[in]: vectorDistance({dict1}, {dict2})

[out]: distance

所需的输出：

      1     2      3      4 
id
1     0   1<->2  1<->3  1<->4
2   1<->2   0     ...    ...
3   1<->3  ...     0     ...
4   1<->4  ...    ...     0

（其中 12 是向量 1 和 2 之间的浮点距离）

使用方法：

import pandas as pd

matrix = pd.concat([df, df.T], axis=1)

for index in matrix.index:
    for col in matrix.columns:
        matrix.ix[col, index] = vectorDistance(col, index, df)

>>> matrix
          5072142538    5072134420  4716823618   ...
udid            
5072142538  0.00000      0.01501       0.06002   ...
5072134420  0.01501      0.00000       0.09037   ...
4716823618  0.06002      0.09037       0.00000   ...
    ...        ...          ...          ...

编辑：

小例子

注意：事件可以从一个 {dict} 到另一个不同，但在函数中传递时没关系。我的问题更多是如何通过正确的 a & b 快速填充单元格。

我正在处理余弦距离，因为它与我的向量等向量相当好。

from collections import Counter
import pandas as pd 
from math import sqrt 


raw_data = {'counters_': {4716823618: Counter({51811: 1, 51820: 1, 51833: 56, 51835: 8, 51843: 48, 51848: 2, 51852: 8, 51853: 5, 51854: 4, 51856: 24, 51903: 11, 51904: 12, 51905: 3, 51906: 19, 51908: 230, 51922: 24, 51927: 19, 51931: 2, 106282: 9, 112830: 1, 119453: 1, 165062: 80, 168904: 3, 180354: 19, 180437: 33, 185824: 117, 186171: 14, 187101: 1, 190827: 7, 201629: 1, 209318: 37}), 5072134420: Counter({51811: 1, 51812: 1, 51820: 1, 51833: 56, 51835: 9, 51843: 49, 51848: 2, 51852: 11, 51853: 4, 51854: 4, 51856: 28, 51885: 1, 51903: 17, 51904: 17, 51905: 9, 51906: 14, 51908: 225, 51927: 29, 51931: 2, 106282: 19, 112830: 2, 168904: 9, 180354: 14, 185824: 219, 186171: 7, 187101: 1, 190827: 6, 201629: 2, 209318: 41}), 5072142538: Counter({51811: 4, 51812: 4, 51820: 4, 51833: 56, 51835: 8, 51843: 48, 51848: 2, 51852: 6, 51853: 3, 51854: 3, 51856: 18, 51885: 1, 51903: 17, 51904: 16, 51905: 3, 51906: 24, 51908: 258, 51927: 20, 51931: 8, 106282: 16, 112830: 2, 168904: 3, 180354: 24, 185824: 180, 186171: 10, 187101: 1, 190827: 7, 201629: 2, 209318: 52})}}


def vectorDistance(index, col):
    a = dict(df[df.index == index]["counters_"].values[0])
    b = dict(df[df.index == col]["counters_"].values[0])
    return abs(np.round(1-(similarity(a,b)),5))

def scalar(collection): 
  total = 0 
  for coin, count in collection.items(): 
    total += count * count 
  return sqrt(total) 

def similarity(A,B): 
  total = 0 
  for kind in A:
    if kind in B: 
      total += A[kind] * B[kind] 
  return float(total) / (scalar(A) * scalar(B))

df = pd.DataFrame(raw_data)
matrix = pd.concat([df, df.T], axis=1)
matrix.drop("counters_",0,inplace=True)
matrix.drop("counters_",1,inplace=True)

for index in matrix.index:
    for col in matrix.columns:
        matrix.ix[col,index] = vectorDistance(col,index)


matrix

【问题讨论】：

每部字典是否有相同的事件，或者它们可以不同？您可能还需要提供 vectorDistance 函数的详细信息，以便其他人可以复制结果。
嗨@Alexander，对不起，我不小心按了 Enter，我在问题中添加了您需要的详细信息，因为它很长 :)
大概有多少独特的事件？只是想知道计算每对之间的距离并进行查找是否可行。
1000左右，不过让我用真实数据做一个最小的示例文件，我会添加问题的链接，应该不会太长。
@Alexander，你去吧，我添加了具有实际值的最小示例。如果您能找到解决方案，期待您阅读。

标签： performance pandas matrix vector distance

【解决方案1】：

您不想在数据框中存储字典。使用from_dict 方法读入你的数据框：

df = pd.DataFrame.from_dict(raw_data['counters_'],orient='index')

然后您可以应用 numpy/scipy 矢量化方法来计算余弦相似度，如 What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

【讨论】：

您好@maxymoo，我明白您的建议，这是个好主意，但是您指出的线程尚未解决或至少对我来说不清楚，您是否介意基于最小示例添加一个插图上怎么做？我真的很感激:)

【解决方案2】：

这肯定比使用for 循环更有效且更易于阅读。

df = pd.DataFrame([v for v in raw_data['counters_'].values()], 
                  index=raw_data['counters_'].keys()).T

>>> df.head()
       4716823618  5072134420  5072142538
51811           1           1           4
51812         NaN           1           4
51820           1           1           4
51833          56          56          56
51835           8           9           8

# raw_data no longer needed.  Delete to reduce memory footprint.
del raw_data  

# Create scalars.
scalars = ((df ** 2).sum()) ** .5

>>> scalars
4716823618    289.679133
5072134420    330.548030
5072142538    331.957829
dtype: float64

def v_dist(col_1, col_2):
    return 1 - ((df.iloc[:, col_1] * df.iloc[:, col_2]).sum() / 
                (scalars.iloc[col_1] * scalars.iloc[col_2]))

>>> v_dist(0, 1)
0.09036665882900885

>>> v_dist(0, 2)
0.060016436804916085

>>> v_dist(1, 2)
0.015009898476505357

m = pd.DataFrame(np.nan * len(df.columns), index=df.columns, columns=df.columns)

>>> m
            4716823618  5072134420  5072142538
4716823618         NaN         NaN         NaN
5072134420         NaN         NaN         NaN
5072142538         NaN         NaN         NaN

for row in range(m.shape[0]):
    for col in range(row, m.shape[1]):  # Note: m.shape[0] equals m.shape[1]
        if row == col:
            # No need to calculate value for diagonal.
            m.iat[row, col] = 0
        else:
            # Do two calculation in one due to symmetry.
            m.iat[row, col] = m.iat[col, row] = v_dist(row, col)

>>> m
            4716823618  5072134420  5072142538
4716823618    0.000000    0.090367    0.060016
5072134420    0.090367    0.000000    0.015010
5072142538    0.060016    0.015010    0.000000

将所有这些包装到一个函数中：

def calc_matrix(raw_data):
    df = pd.DataFrame([v for v in raw_data['counters_'].values()], 
                      index=raw_data['counters_'].keys()).T
    scalars = ((df ** 2).sum()) ** .5
    m = pd.DataFrame(np.nan * len(df.columns), index=df.columns, columns=df.columns)
    for row in range(m.shape[0]):
        for col in range(row, m.shape[1]):
            if row == col:
                m.iat[row, col] = 0
            else:
                m.iat[row, col] = m.iat[col, row] =  (1 -                    
                    (df.iloc[:, row] * df.iloc[:, col]).sum() / 
                    (scalars.iloc[row] * scalars.iloc[col]))
    return m

【讨论】：

伟大的贡献，感谢@Alexander，它击败了我的实现，它快了大约 2 倍 :)
哈哈，对不起。 :) 好吧，如果您有更优化的版本，我很乐意接受。这显然更快，但仍然需要 7/10~ 分钟才能构建真正大小的数据框。
无法获得更多信息。我能做的最好的。