使用多个字典填充 pandas 数据框答案

【问题标题】：Using multiple dictionaries to populate a pandas dataframe使用多个字典填充 pandas 数据框
【发布时间】：2018-09-27 14:37:13
【问题描述】：

我目前有一个数据框并希望使用成对的字典输入值。

# create count dataframe
range_of_years = range(2012, 2017)
topics = ['ecology','evolution','mathematics','biogeography','neutral theory']
topic_count_timeline = pandas.DataFrame(index = topics, columns = range_of_years)


# dictionary pair
count_dict = {2012: 10, 2013: 20, 2014: 12, 2015: 8, 2016: 9}
paper_topics_dict = {'ecology': 0.7, 'neutral theory': 0.3}

我想遍历字典键，选择具有与键对应的列和索引的数据框单元格，然后将字典值的乘积添加到该单元格。这样我就会得到结果数据框：

               2012 2013 2014 2015 2016
ecology           7   14  8.4  5.6  7.7
evolution       NaN  NaN  NaN  NaN  NaN
mathematics     NaN  NaN  NaN  NaN  NaN
biogeography    NaN  NaN  NaN  NaN  NaN
neutral theory    3    6  3.6  2.4  3.3

我打算使用count_dict 和paper_topic_dict 等多对字典来更新topic_count_timeline 数据框，以便将新输入与单元格的前任相加而不是覆盖。

例如，如果使用另一对更新数据框...

# Additional dictionaries
count_dict2 = {2012: 3, 2013: 2, 2014: 15, 2015: 16, 2016: 13}
paper_topics_dict2 = {'mathematics': 0.6, 'neutral theory': 0.4}

然后数据框将如下所示：

               2012 2013 2014 2015 2016
ecology           7   14  8.4  5.6  7.7
evolution       NaN  NaN  NaN  NaN  NaN
mathematics     1.8  1.2    9  9.6  5.4
biogeography    NaN  NaN  NaN  NaN  NaN
neutral theory  4.2  6.8  9.6  8.8  8.5

谢谢。

【问题讨论】：

字典对在列表中吗？例如在L = [(count_dict, paper_topics_dict), (count_dict2, paper_topics_dict2)] ?
请从I intend to update the topic_count_timeline dataframe using many pairs 中删除以结束，因为接受答案是不行的。然后一切都很好，我也可以删除我的答案更新。谢谢。

标签： python python-3.x pandas dictionary dataframe

【解决方案1】：

我认为需要：

for k, v in paper_topics_dict.items():
    topic_count_timeline.loc[k] = v

for k, v in count_dict.items():
    topic_count_timeline[k] *= v

print (topic_count_timeline)
               2012 2013 2014 2015 2016
ecology           7   14  8.4  5.6  6.3
evolution       NaN  NaN  NaN  NaN  NaN
mathematics     NaN  NaN  NaN  NaN  NaN
biogeography    NaN  NaN  NaN  NaN  NaN
neutral theory    3    6  3.6  2.4  2.7

但如果使用字典对更好地处理每个对更新defaultdict，则将其转换为Series，对于DataFrame，添加unstack 和reindex 用于缺少列和索引值：

from collections import defaultdict

count_dict = {2012: 10, 2013: 20, 2014: 12, 2015: 8, 2016: 9}
paper_topics_dict = {'ecology': 0.7, 'neutral theory': 0.3}

count_dict2 = {2012: 3, 2013: 2, 2014: 15, 2015: 16, 2016: 13}
paper_topics_dict2 = {'mathematics': 0.6, 'neutral theory': 0.4}

L = [(count_dict, paper_topics_dict), (count_dict2, paper_topics_dict2)]

d = defaultdict(float)
for a, b in L:
    for k, v in b.items():
        for k2, v2 in a.items():
            d[(k, k2)] += v*v2

df = pd.Series(d).unstack().reindex(index=topics, columns=range_of_years)
print (df)
                2012  2013  2014  2015  2016
ecology          7.0  14.0   8.4   5.6   6.3
evolution        NaN   NaN   NaN   NaN   NaN
mathematics      1.8   1.2   9.0   9.6   7.8
biogeography     NaN   NaN   NaN   NaN   NaN
neutral theory   4.2   6.8   9.6   8.8   7.9

【讨论】：

我可以在数据框中添加更多的字典对元组列表L 吗？我创建字典的速度受到我用来提取数据的 API 的限制，我希望能够不断地添加到数据帧中。谢谢
@Lorcán - 当然，添加它们没有问题。

【解决方案2】：

您可以使用combine_first 并为您的dict 创建一个新的df

topic_count_timeline.combine_first(pd.DataFrame(data=np.array(list(count_dict.values()))*np.array(list(paper_topics_dict.values()))[:,None],columns=count_dict.keys(),index=paper_topics_dict.keys()))
Out[683]: 
                2012  2013  2014  2015  2016
biogeography     NaN   NaN   NaN   NaN   NaN
ecology          7.0  14.0   8.4   5.6   6.3
evolution        NaN   NaN   NaN   NaN   NaN
mathematics      NaN   NaN   NaN   NaN   NaN
neutral theory   3.0   6.0   3.6   2.4   2.7

更多信息

pd.DataFrame(data=np.array(list(count_dict.values()))*np.array(list(paper_topics_dict.values()))[:,None],columns=count_dict.keys(),index=paper_topics_dict.keys())
Out[684]: 
                2012  2013  2014  2015  2016
ecology          7.0  14.0   8.4   5.6   6.3
neutral theory   3.0   6.0   3.6   2.4   2.7

【讨论】：

【解决方案3】：

我会为此使用一个函数和pd.DataFrame.pipe。

然后您可以将pipe 语法用于后续字典。

def update_data(df, counts, topics):
    for k, v in topics.items():
        for k2, v2 in counts.items():
            df.loc[k, k2] = v*v2
    return df

count_dict = {2012: 10, 2013: 20, 2014: 12, 2015: 8, 2016: 9}
paper_topics_dict = {'ecology': 0.7, 'neutral theory': 0.3}

df = df.pipe(update_data, count_dict, paper_topics_dict)

print(df)

#                2012 2013 2014 2015 2016
# ecology           7   14  8.4  5.6  6.3
# evolution       NaN  NaN  NaN  NaN  NaN
# mathematics     NaN  NaN  NaN  NaN  NaN
# biogeography    NaN  NaN  NaN  NaN  NaN
# neutral theory    3    6  3.6  2.4  2.7

【讨论】：