根据条件在Python中连接两个表[重复]答案

【问题标题】：Joining Two Tables in Python Based on Condition [duplicate]根据条件在Python中连接两个表[重复]
【发布时间】：2018-09-24 08:48:56
【问题描述】：

我在 pandas 中有两个表：

df1：包含 150K 用户的用户 ID 和 IP_Addresses。

|---------------|---------------|  
|    User_ID    |   IP_Address  |
|---------------|---------------|  
|      U1       |   732758368.8 |
|      U2       |   350311387.9 |
|      U3       |   2621473820  |
|---------------|---------------|

df2：包含IP地址范围和所属国家，139K条记录

|---------------|-----------------|------------------|  
|    Country    | Lower_Bound_IP  |  Upper_Bound_IP  |
|---------------|-----------------|------------------|  
|   Australia   |   1023787008    |    1023791103    |
|   USA         |   3638734848    |    3638738943    |
|   Australia   |   3224798976    |    3224799231    |
|   Poland      |   1539721728    |    1539721983    |
|---------------|-----------------|------------------|

我的目标是在 df1 中创建一个国家/地区列，使 df1 的 IP_Address 位于 df2 中该国家/地区的 Lower_Bound_IP 和 Upper_Bound_IP 的范围之间。

|---------------|---------------|---------------|   
|    User_ID    |   IP_Address  |    Country    |
|---------------|---------------|---------------|   
|      U1       |   732758368.8 |   Indonesia   |
|      U2       |   350311387.9 |   Australia   |
|      U3       |   2621473820  |   Albania     |
|---------------|---------------|---------------|

我的第一种方法是对两个表进行交叉连接（笛卡尔积），然后过滤到相关记录。但是，使用 pandas.merge() 进行交叉连接是不可行的，因为它将创建 210 亿条记录。代码每次都会崩溃。您能否提出一个可行的替代解决方案？

【问题讨论】：

IP_Address 范围是否全面？即，df1 中是否存在您希望 Country 为空的 IP_Address 值？
@cmaher 我现在假设范围很全面，因此任何用户都不会有空国家/地区。

标签： python python-3.x pandas join merge

【解决方案1】：

我不确定如何使用 pandas.where 处理此问题，但使用 numpy.where 你可以做到

idx = numpy.where((df1.Ip_Address[:,None] >= df2.Lower_Bound_IP[None,:]) 
    & (df1.IP_Address[:,None] <= df2.Upper_Bound_IP[None,:]))[1]
df1["Country"] = df2.Country[idx]

numpy.where 给出给定条件为真的索引。 & 对应 'and'，整个[:,None] 位在None 所在的位置添加了一个虚拟轴。这可以确保对于每个User_ID，在df2 中的索引位于IP_Address 的范围内。 [1] 给出条件为 True 的 df2 中的索引。如果您在 df2 中的范围有重叠，这将崩溃。

这可能仍会导致您遇到内存问题，但您可以添加一个循环，以便批量进行此比较。例如

batch_size = 1000
n_batches = df1.shape[0] // batch_size
# Integer division rounds down, so if the number
# of User_ID's is not divisable by the batch_size,
# we need to add 1 to n_batches
if n_batches * batch_size < df1.shape[0]:
    n_batches += 1
indices = []
for i in range(n_batches):
    idx = numpy.where((df1.Ip_Address[i*batch_size:(i+1)*batch_size,None]
            >= df2.Lower_Bound_IP[None,:]) & 
            (df1.IP_Address[i*batch_size:(i+1)*batch_size,None] 
            <= df2.Upper_Bound_IP[None,:]))[1]
    indices.extend(idx.tolist())

df1["Country"] = df2.Country[np.asarray(indices)]

【讨论】：

它就像魅力一样。太感谢了。批量计算对我来说真的很有帮助。大大减轻了内存的负担。