根据其他数据帧中的索引列表从其他数据帧向数据帧添加新列答案

【问题标题】：Adding new columns to dataframe from other dataframe according to list of indices in other dataframe根据其他数据帧中的索引列表从其他数据帧向数据帧添加新列
【发布时间】：2019-07-22 11:12:05
【问题描述】：

我有两个数据框，数据框A 中的每一行都有一个与数据框B 中的条目相对应的索引列表和一组其他值。我想以某种方式连接两个数据帧，以便B 中的每个条目在A 中具有其他值，其中B 中的条目索引位于@987654327 中条目的索引列表中@。

到目前为止，我已经找到了一种方法来提取 B 中的行，以获取 A 中每行中的索引列表，但只能从 this answer 中逐行提取，但我不确定在哪里从这里走？也不确定是否有更好的方法来动态处理 Pandas，因为索引列表的大小可能会发生变化。

import pandas as pd
import numpy as np

# Inputs
A = pd.DataFrame.from_dict({
    "indices": [[0,1],[2,3],[4,5]],
    "a1": ["a","b","c"],
    "a2": [100,200,300]
})

print(A)
>>    indices a1   a2
>> 0  [0, 1]  a  100
>> 1  [2, 3]  b  200
>> 2  [4, 5]  c  300

B = pd.DataFrame.from_dict({
    "b": [10,20,30,40,50,60]
})

print(B)
>>     b
>> 0  10
>> 1  20
>> 2  30
>> 3  40
>> 4  50
>> 5  60

# This is the desired output
out = pd.DataFrame.from_dict({
    "b": [10,20,30,40,50,60],
    "a1": ["a","a", "b", "b", "c", "c"],
    "a2": [100,100,200,200,300,300]
})

print(out)
>>      b a1   a2
>> 0  10  a  100
>> 1  20  a  100
>> 2  30  b  200
>> 3  40  b  200
>> 4  50  c  300
>> 5  60  c  300

【问题讨论】：

标签： python pandas dataframe

【解决方案1】：

如果你有 pandas >=0.25，你可以使用explode：

C = A.explode('indices')

这给出了：

  indices a1   a2
0       0  a  100
0       1  a  100
1       2  b  200
1       3  b  200
2       4  c  300
2       5  c  300

然后做：

output = pd.merge(B, C, left_index = True, right_on = 'indices')
output.index = output.indices.values    
output.drop('indices', axis = 1, inplace = True)

最终输出：

    b a1   a2
0  10  a  100
1  20  a  100
2  30  b  200
3  40  b  200
4  50  c  300
5  60  c  300

【讨论】：

imo 对于使用 >=0.25 的人来说，这是最好的答案。 explode 是您在其他框架（例如 spark）中执行此操作的方式。我认为通常最好使用drop 而不是del 来删除列，对吗？
谢谢杰夫！是的你是对的。将其更改为 drop :)
因为@Jeff 的评论，我会接受这个，它类似于其他框架如何处理这样的操作。
@AnkurSinha 刚刚意识到，索引是错误的，它们应该是 [0,1,2,3,4,5] 而不是 [0,0,1,1,2,2]。看起来输出数据框使用的是 C 的索引而不是 B 的
或者您也可以在删除 output.indices 列之前使用：output.index = output.indices.values。

【解决方案2】：

使用pd.merge

df2 = pd.DataFrame(A.set_index(['a1','a2']).indices)

df = pd.DataFrame(df2.indices.values.tolist(), index=a.index).stack().reset_index().drop('level_2', axis=1).set_index(0)

pd.merge(B,df,left_index=True, right_index=True)

输出

    b a1   a2
0  10  a  100
1  20  a  100
2  30  b  200
3  40  b  200
4  50  c  300
5  60  c  300

【讨论】：

【解决方案3】：

给你：

helper = A.indices.apply(pd.Series).stack().reset_index(level=1, drop=True)
A = A.reindex(helper.index).drop(columns=['indices'])
A['indices'] = helper
B = B.merge(A, left_index=True, right_on='indices').drop(columns=['indices']).reset_index(drop=True)

结果：

    b   a1  a2
0   10  a   100
1   20  a   100
2   30  b   200
3   40  b   200
4   50  c   300
5   60  c   300

【讨论】：

【解决方案4】：

你也可以使用melt代替stack，但是它更复杂，因为你必须删除你不需要的列：

import pandas as pd
import numpy as np

# Inputs
A = pd.DataFrame.from_dict({
    "indices": [[0,1],[2,3],[4,5]],
    "a1": ["a","b","c"],
    "a2": [100,200,300]
})

B = pd.DataFrame.from_dict({
    "b": [10,20,30,40,50,60]
})

AA = pd.concat([A.indices.apply(pd.Series), A], axis=1)
AA.drop(['indices'], axis=1, inplace=True)
print(AA)

   0  1 a1   a2
0  0  1  a  100
1  2  3  b  200
2  4  5  c  300

AA = AA.melt(id_vars=['a1', 'a2'], value_name='val').drop(['variable'], axis=1)
print(AA)

  a1   a2  val
0  a  100    0
1  b  200    2
2  c  300    4
3  a  100    1
4  b  200    3
5  c  300    5

pd.merge(AA.set_index(['val']), B, left_index=True, right_index=True)

Out[8]: 
  a1   a2   b
0  a  100  10
2  b  200  30
4  c  300  50
1  a  100  20
3  b  200  40
5  c  300  60

【讨论】：

【解决方案5】：

此解决方案将处理不同长度的索引。

A = pd.DataFrame.from_dict({
    "indices": [[0,1],[2,3],[4,5]],
    "a1": ["a","b","c"],
    "a2": [100,200,300]
})
A = A.indices.apply(pd.Series) \
    .merge(A, left_index = True, right_index = True) \
    .drop(["indices"], axis = 1)\
    .melt(id_vars = ['a1', 'a2'], value_name = "index")\
    .drop("variable", axis = 1)\
    .dropna()
A = A.set_index('index')
B = pd.DataFrame.from_dict({
    "b": [10,20,30,40,50,60]
})
B
B.merge(A,left_index=True,right_index=True)

最终输出：

    b   a1  a2
0   10  a   100
1   20  a   100
2   30  b   200
3   40  b   200
4   50  c   300
5   60  c   300

【讨论】：