Python/Pandas 有没有办法将与相反类别中所有其他点的比较矢量化？答案

【问题标题】：Python/Pandas is there a way to vectorize a comparison to all other points in an opposing category?Python/Pandas 有没有办法将与相反类别中所有其他点的比较矢量化？
【发布时间】：2020-02-18 04:07:23
【问题描述】：

我有一个包含两个不同类别的 x,y 点的数据集。我想对十个左右的点进行分组（或拆分）而不是遍历许多“帧”。我想将 A 类中的每个点与 B 类中的所有点进行比较。具体来说，我想要它们之间的距离。我还没有找到合适的 groupby 操作组合来对其进行矢量化。

这是一个示例 df：


   frame_id point_id      x      y cat
0         1        1  1.769  2.491   A
1         1        2  1.024  0.981   A
2         1        3  4.327   9.81   A
3         1        4  5.407   4.33   A
4         1        5  0.936  0.019   B
5         1        6    5.1  7.639   B
6         1        7  9.139  6.721   B
7         1        8  1.954  5.424   B
8         2        1  5.835  9.702   A
9         2        2  1.784  1.374   A
10        2        3   0.23  1.921   A
11        2        4  9.328  5.836   A
12        2        5  5.516  8.971   B
13        2        6  9.108  8.917   B
14        2        7  4.412  1.033   B
15        2        8   1.33  5.898   B

理想情况下，在此示例中，我会添加四列。每个距离指向另一类别的一列。我想有一些方法可以做 df.groupby(['frame_id']) 或 df.groupby(['frame_id','cat']) 并以这种方式比较它们，我只是还没弄清楚。

我已经能够通过迭代来实现这一点：

import scipy.spatial


for idx, fid in enumerate(frame_ids):

    if idx % 1000 == 0:
        print(idx)

    # separate categories
    cat_a = df.loc[(df.frame_id==fid)&(df.Cat=="A")]
    cat_b = df.loc[(df.frame_id==fid)&(df.Cat=="B")]

    # get distance to every opposing category point
    a_mat = scipy.spatial.distance.cdist(cat_a[['X','Y']], cat_b[['X','Y']], metric='euclidean')
    b_mat = scipy.spatial.distance.cdist(cat_b[['X','Y']], cat_a[['X','Y']], metric='euclidean')

    a_ids = cat_a[['frame_id','point_id']].values
    b_ids = cat_b[['frame_id','point_id']].values

    a_dist = np.concatenate((a_ids, a_mat),axis=1)
    b_dist = np.concatenate((b_ids, b_mat),axis=1)


    ### then concat one by one w/ larger dataframe (takes forever) ###

输出（为了清楚起见，删除了几列）：

   frame_id point_id Dist_Opp1 Dist_Opp2 Dist_Opp3 Dist_Opp4
0         1        1   2.60858   6.13168   8.49763   2.93883
1         1        2  0.966017   7.80658   9.93986   4.53929
2         1        3   10.3616   2.30451   5.71815    4.9868
3         1        4   6.21084   3.32321   4.43223   3.62216
4         1        5   2.60858  0.966017   10.3616   6.21084
5         1        6   6.13168   7.80658   2.30451   3.32321
6         1        7   8.49763   9.93986   5.71815   4.43223
7         1        8   2.93883   4.53929    4.9868   3.62216
8         2        1  0.797573   3.36582   8.78502   5.89622
9         2        2   8.46417   10.5137   2.65003   4.54672
10        2        3    8.8116   11.3032   4.27524   4.12632
11        2        4   4.93554   3.08884   6.87284   7.99824
12        2        5  0.797573   8.46417    8.8116   4.93554
13        2        6   3.36582   10.5137   11.3032   3.08884
14        2        7   8.78502   2.65003   4.27524   6.87284
15        2        8   5.89622   4.54672   4.12632   7.99824

同一类别中的分数无需比较。

【问题讨论】：

标签： python pandas numpy

【解决方案1】：

最终想通了。它只需要使用 numpy 矩阵进行创造性的重塑/重复。


    df['loc'] = list(zip(df['x'],df['y']))
    groupA = df.loc[df.Cat==1]
    groupB = df.loc[df.Cat==0]

    groupA = groupA[['frame_id','point_id','loc']]
    groupB = groupB[['frame_id','point_id','loc']]

    acol = groupA['loc'].values
    bcol = groupB['loc'].values

    group_size = 4
    acol = np.repeat(acol,group_size,axis=0)

    bcol = bcol.reshape(-1,group_size)
    bcol = np.repeat(bcol,group_size,axis=0)
    bcol = bcol.reshape(-1)

    # numpy requires replacing tuple with 2d point
    acol = np.array([*acol])
    bcol = np.array([*bcol])

    # distance calc
    desired_matrix = np.linalg.norm(acol - bcol, axis=-1)

【讨论】：