Python/pandas：来自 dict 系列的数据框：优化答案

【问题标题】：Python/pandas: data frame from series of dict: optimizationPython/pandas：来自 dict 系列的数据框：优化
【发布时间】：2016-06-06 14:52:14
【问题描述】：

我有一个 pandas 系列的字典，我想将其转换为具有相同索引的数据框。

我找到的唯一方法是通过系列的to_dict方法，效率不是很高，因为它回到了纯python模式而不是numpy/pandas/cython。

您对更好的方法有什么建议吗？

非常感谢。

>>> import pandas as pd
>>> flagInfoSeries = pd.Series(({'a': 1, 'b': 2}, {'a': 10, 'b': 20}))
>>> flagInfoSeries
0      {'a': 1, 'b': 2}
1    {'a': 10, 'b': 20}
dtype: object
>>> pd.DataFrame(flagInfoSeries.to_dict()).T
    a   b
0   1   2
1  10  20

【问题讨论】：

标签： python pandas python-3.4

【解决方案1】：

我认为你可以使用理解：

import pandas as pd

flagInfoSeries = pd.Series(({'a': 1, 'b': 2}, {'a': 10, 'b': 20}))
print flagInfoSeries
0      {u'a': 1, u'b': 2}
1    {u'a': 10, u'b': 20}
dtype: object

print pd.DataFrame(flagInfoSeries.to_dict()).T
    a   b
0   1   2
1  10  20

print pd.DataFrame([x for x in flagInfoSeries])
    a   b
0   1   2
1  10  20

时机：

In [203]: %timeit pd.DataFrame(flagInfoSeries.to_dict()).T
The slowest run took 4.46 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 554 µs per loop

In [204]: %timeit pd.DataFrame([x for x in flagInfoSeries])
The slowest run took 5.11 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 361 µs per loop

In [209]: %timeit flagInfoSeries.apply(lambda dict: pd.Series(dict))
The slowest run took 4.76 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 751 µs per loop

编辑：

如果需要保留索引，请尝试将index=flagInfoSeries.index 添加到DataFrame 构造函数：

print pd.DataFrame([x for x in flagInfoSeries], index=flagInfoSeries.index)

时间安排：

In [257]: %timeit pd.DataFrame([x for x in flagInfoSeries], index=flagInfoSeries.index)
1000 loops, best of 3: 350 µs per loop

示例：

import pandas as pd

flagInfoSeries = pd.Series(({'a': 1, 'b': 2}, {'a': 10, 'b': 20}))
flagInfoSeries.index = [2,8]
print flagInfoSeries
2      {u'a': 1, u'b': 2}
8    {u'a': 10, u'b': 20}

print pd.DataFrame(flagInfoSeries.to_dict()).T
    a   b
2   1   2
8  10  20

print pd.DataFrame([x for x in flagInfoSeries], index=flagInfoSeries.index)
    a   b
2   1   2
8  10  20

【讨论】：

是的，所以您的计算机速度更快，但您的代码仍然胜出 :)
是的，你是对的。我想在我的电脑中添加比较。 :)
感谢您的建议。实际上，在性能方面有所改进......但没有保留索引：列表理解给出了一个列表[{mydict}, ...]，没有索引，而to_dict 给出了一个{index: {mydict}, ...} 的字典。我想我会暂时保持这种状态。
解决方案已修改，请检查。
索引更快！

【解决方案2】：

这避免了to_dict，但apply 也可能很慢：

flagInfoSeries.apply(lambda dict: pd.Series(dict))

编辑：我看到jezrael 增加了时间比较。这是我的：

%timeit flagInfoSeries.apply(lambda dict: pd.Series(dict))
1000 loops, best of 3: 935 µs per loop

【讨论】：

谢谢。我已经尝试过了，但确实，apply 很慢。