提高 iloc 的效率 - 用 startwith() 合并两个表答案

【问题标题】：Improve efficiency of iloc - merging two tables with startwith()提高 iloc 的效率 - 用 startwith() 合并两个表
【发布时间】：2021-03-21 12:24:48
【问题描述】：

我正在尝试合并两个表，例如： df1：

ID
A1A1A1
A1A1A2
B1B1B1

df2:

ID	Country
A1A1A1	France
B	Egypt
C1C	Egypt

在表 2 中，当 ID = B 时，表示所有以 B 开头的 ID 将具有相同的国家/地区。同样适用于 ID = CCC1 因此，我不能使用 pd.merge 因为我不能在完全匹配上合并。我编写了一个似乎可以工作的代码（使用调试），但速度非常慢。因此，我正在寻找更快的解决方案。我的 df1 有 ~80K 行，df2 有 ~7K 行。

预期输出： df2：

ID	Country
A1A1A1	France
A1A1A2	nan
B1B1B1	Egypt

这是我所做的：

for i in range(len(df2)):
   for j in range(len(df1)):
      if df1['ID'].iloc[j].startswith(df2['ID'].iloc[i]):
        df1['Country'].iloc[j] = df2['Country'].iloc[i]

谢谢！

【问题讨论】：

请发布您的预期输出。
你能在你的 df2 中建立一套完整的键/值对吗？如果这是可能的，你可以只使用merge()。如果这不可能，您的 ID 字符串是否有最大长度？另外，如果 ID“B1”映射到“USA”会发生什么？ ID“B12”的预期输出是“Egypt”，因为它以“B”开头还是USA，因为它以“B1”开头？
你能多描述一下ID栏吗？有确定的长度/组成吗？例如，它们总是以字母开头并以数字结尾吗？
@JasonCook 在 df1 中，所有 ID 都有 6 个字符。在 df2 中，ID 可以包含 1 到 6 个字符之间的任何字符。除了长度之外，两个 dfs 中的 ID 格式相同。格式为：字母-数字-字母-数字-字母-数字
@above_c_level 数据是干净的，如果df2中有ID ='B'和Country ='Egypt'，那么所有以B开头的ID都会映射到埃及。不会有任何混淆，但是如果要使代码健壮，我想要一条错误消息。然而，ID 列并不完全相同，因此据我所知，我不能使用 merge() 函数。

标签： python pandas merge

【解决方案1】：

您的问题的一个解决方案是构建具有所有（可能）不同长度的临时列。在你的情况下 6 列。然后，您可以将 df2 转换为字典，然后在字典中查找 id。然后将列与 combine_first 合并。请注意，您的列列表的顺序对 combine_first 很重要。

import pandas as pd

ids = ['A1A1A1', 'A1A1A2', 'B1B1B1']
df1 = pd.DataFrame(ids, columns=['ID'], index=range(3))
df2 = pd.DataFrame.from_dict({'ID': {0: 'A1A1A1', 1: 'B', 2: 'C1C'},
                              'Country': {0: 'France', 1: 'Egypt', 2: 'Egypt'}})

# build dictionary from df2 (dictionary is probably faster than .loc). Also it is cleaner
map_id_dict = df2.set_index('ID')['Country'].to_dict()

# Define target column
df1['Country'] = None
# Build temporary columns
cols = [f'ID_{i}' for i in range(1, 7)]
for i, col in enumerate(cols):
    # lookup ids in dictionary from df2
    df1[col] = df1['ID'].str[:i + 1].apply(lambda x: map_id_dict.get(x))
    df1['Country'] = df1['Country'].combine_first(df1[col])
# drop temporary columns
df1 = df1.drop(columns=cols)

输出：

       ID Country
0  A1A1A1  France
1  A1A1A2    None
2  B1B1B1   Egypt

【讨论】：