【问题标题】:Pandas: Most efficient way to make dictionary of dictionaries from DataFrame columnsPandas:从 DataFrame 列制作字典的最有效方法
【发布时间】:2016-01-15 14:35:05
【问题描述】:

import pandas as pd
import numpy as np
import random

labels = ["c1","c2","c3"]
c1 = ["one","one","one","two","two","three","three","three","three"]
c2 = [random.random() for i in range(len(c1))]
c3 = ["alpha","beta","gamma","alpha","gamma","alpha","beta","gamma","zeta"]
DF = pd.DataFrame(np.array([c1,c2,c3])).T
DF.columns = labels

DataFrame 看起来像:

      c1               c2     c3
0    one   0.440958516531  alpha
1    one   0.476439953723   beta
2    one   0.254235673552  gamma
3    two   0.882724336464  alpha
4    two    0.79817899139  gamma
5  three   0.677464637887  alpha
6  three   0.292927670096   beta
7  three  0.0971956881825  gamma
8  three   0.993934915508   zeta

我能想到制作字典的唯一方法是:

D_greek_value = {}

for greek in set(DF["c3"]):
    D_c1_c2 = {}
    for i in range(DF.shape[0]):
        row = DF.iloc[i,:]
        if row[2] == greek:
            D_c1_c2[row[0]] = row[1]
    D_greek_value[greek] = D_c1_c2
D_greek_value

生成的字典如下所示:

{'alpha': {'one': '0.67919712421',
  'three': '0.67171020684',
  'two': '0.571150669821'},
 'beta': {'one': '0.895090207979', 'three': '0.489490074662'},
 'gamma': {'one': '0.964777504708',
  'three': '0.134397632659',
  'two': '0.10302290374'},
 'zeta': {'three': '0.0204226923557'}}

我不想假设 c1 会成块出现(“一个”每次都在一起)。我在几百 MB 的 csv 上执行此操作,我觉得我做错了。如果您有任何想法,请提供帮助!

【问题讨论】:

    标签: python pandas hash machine-learning dataframe


    【解决方案1】:

    IIUC,您可以利用groupby 来处理大部分工作:

    >>> result = df.groupby("c3")[["c1","c2"]].apply(lambda x: dict(x.values)).to_dict()
    >>> pprint.pprint(result)
    {'alpha': {'one': 0.440958516531,
               'three': 0.677464637887,
               'two': 0.8827243364640001},
     'beta': {'one': 0.47643995372299996, 'three': 0.29292767009599996},
     'gamma': {'one': 0.254235673552,
               'three': 0.0971956881825,
               'two': 0.79817899139},
     'zeta': {'three': 0.993934915508}}
    

    一些解释:首先我们按 c3 分组,并选择列 c1 和 c2。这给了我们想要变成字典的组:

    >>> grouped = df.groupby("c3")[["c1", "c2"]]
    >>> grouped.apply(lambda x: print(x,"\n","--")) # just for display purposes
          c1                   c2
    0    one    0.679926178687387
    3    two  0.11495090934413166
    5  three   0.7458197179794177 
     --
          c1                   c2
    0    one    0.679926178687387
    3    two  0.11495090934413166
    5  three   0.7458197179794177 
     --
          c1                   c2
    1    one  0.12943266757277916
    6  three  0.28944292691097817 
     --
          c1                   c2
    2    one  0.36642834809341274
    4    two   0.5690944224514624
    7  three   0.7018221838129789 
     --
          c1                  c2
    8  three  0.7195852795555373 
     --
    

    给定这些子帧中的任何一个,比如倒数第二个,我们需要想出一种方法将其转换为字典。例如:

    >>> d3
          c1        c2
    2    one  0.366428
    4    two  0.569094
    7  three  0.701822
    

    如果我们尝试dictto_dict,我们不会得到我们想要的,因为索引和列标签会妨碍:

    >>> dict(d3)
    {'c1': 2      one
    4      two
    7    three
    Name: c1, dtype: object, 'c2': 2    0.366428
    4    0.569094
    7    0.701822
    Name: c2, dtype: float64}
    >>> d3.to_dict()
    {'c1': {2: 'one', 4: 'two', 7: 'three'}, 'c2': {2: 0.36642834809341279, 4: 0.56909442245146236, 7: 0.70182218381297889}}
    

    但是我们可以通过使用.values 下拉到基础数据来忽略这一点,然后可以将其传递给dict

    >>> d3.values
    array([['one', 0.3664283480934128],
           ['two', 0.5690944224514624],
           ['three', 0.7018221838129789]], dtype=object)
    >>> dict(d3.values)
    {'three': 0.7018221838129789, 'one': 0.3664283480934128, 'two': 0.5690944224514624}
    

    因此,如果我们应用它,我们将得到一个 Series,其中索引作为我们想要的 c3 键,值作为字典,我们可以使用 .to_dict() 将其转换为字典:

    >>> result = df.groupby("c3")[["c1", "c2"]].apply(lambda x: dict(x.values))
    >>> result
    c3
    alpha    {'three': '0.7458197179794177', 'one': '0.6799...
    beta     {'one': '0.12943266757277916', 'three': '0.289...
    gamma    {'three': '0.7018221838129789', 'one': '0.3664...
    zeta                       {'three': '0.7195852795555373'}
    dtype: object
    >>> result.to_dict()
    {'zeta': {'three': '0.7195852795555373'}, 'gamma': {'three': '0.7018221838129789', 'one': '0.36642834809341274', 'two': '0.5690944224514624'}, 'beta': {'one': '0.12943266757277916', 'three': '0.28944292691097817'}, 'alpha': {'three': '0.7458197179794177', 'one': '0.679926178687387', 'two': '0.11495090934413166'}}
    

    【讨论】:

    • 非常好。我想知道这是否比我发布的更快。我希望groupby 会非常快,但 lambda 可能会减慢它的速度。不过我懒得去计时了。
    • @StevenRumbalski:我也是。 :-) 我试图看看我是否可以仅使用矢量化操作获得相同的结果但反弹了;其他人可能有更聪明的东西。但是我认为您已经将手指放在了大问题上(迭代次数过多),相比之下,除此之外的所有内容都是次要的..
    • @DSM 我知道如何使用 lambda 函数进行排序,但正是从“.apply”到“.to_dict()”?
    • @O.rka:我添加了一些解释,逐步分解。
    【解决方案2】:

    对于每个唯一的希腊字母,您正在对数据框进行多次迭代。最好只迭代一次。

    由于您需要字典,您可以使用 collections.defaultdictdict 作为嵌套字典的默认构造函数:

    from collections import defaultdict
    
    result = defaultdict(dict)
    for dx, num_word, val, greek in DF.itertuples():
        result[greek][num_word] = val
    

    或者使用常规字典和对setdefault 的调用来创建嵌套字典。

    result = {}
    for dx, num_word, val, greek in DF.itertuples():
        result.setdefault(greek, {})[num_word] = val
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-11-25
      • 2023-04-06
      • 2011-05-27
      • 2021-06-20
      • 1970-01-01
      • 1970-01-01
      • 2016-02-03
      • 2012-07-01
      相关资源
      最近更新 更多