【问题标题】:pivot irregular dictionary of lists into pandas dataframe将不规则的列表字典转换为 pandas 数据框
【发布时间】:2016-01-11 17:37:15
【问题描述】:

(或列表列表...我刚刚编辑)

是否有现有的 python/pandas 方法可以转换这样的结构

food2 = {}
food2["apple"]   = ["fruit", "round"]
food2["bananna"] = ["fruit", "yellow", "long"]
food2["carrot"]  = ["veg", "orange", "long"]
food2["raddish"] = ["veg", "red"]

进入这样的数据透视表?

+---------+-------+-----+-------+------+--------+--------+-----+
|         | fruit | veg | round | long | yellow | orange | red |
+---------+-------+-----+-------+------+--------+--------+-----+
| apple   | 1     |     | 1     |      |        |        |     |
+---------+-------+-----+-------+------+--------+--------+-----+
| bananna | 1     |     |       | 1    | 1      |        |     |
+---------+-------+-----+-------+------+--------+--------+-----+
| carrot  |       | 1   |       | 1    |        | 1      |     |
+---------+-------+-----+-------+------+--------+--------+-----+
| raddish |       | 1   |       |      |        |        | 1   |
+---------+-------+-----+-------+------+--------+--------+-----+

天真地,我可能只是循环浏览字典。我知道如何在每个内部列表上使用地图,但我不知道如何在字典中加入/堆叠它们。一旦我加入他们,我就可以使用 pandas.pivot_table

for key in food2:
    attrlist = food2[key]
    onefruit_pairs = map(lambda x: [key, x], attrlist)
    one_fruit_frame = pd.DataFrame(onefruit_pairs, columns=['fruit', 'attr'])
    print(one_fruit_frame)

     fruit    attr
0  bananna   fruit
1  bananna  yellow
2  bananna    long
    fruit    attr
0  carrot     veg
1  carrot  orange
2  carrot    long
   fruit   attr
0  apple  fruit
1  apple  round
     fruit attr
0  raddish  veg
1  raddish  red

【问题讨论】:

    标签: python pandas pivot-table


    【解决方案1】:

    使用熊猫的答案。

    # Test data
    food2 = {}
    food2["apple"]   = ["fruit", "round"]
    food2["bananna"] = ["fruit", "yellow", "long"]
    food2["carrot"]  = ["veg", "orange", "long"]
    food2["raddish"] = ["veg", "red"]
    
    df = DataFrame(dict([ (k,Series(v)) for k,v in food2.items() ]))
    # pivoting to long format
    df = pd.melt(df, var_name='item', value_name='categ')
    # cross-tabulation
    df = pd.crosstab(df['item'], df['categ'])
    # sorting index, maybe not necessary    
    df.sort_index(inplace=True)
    df
    
    # categ    fruit  long  orange  red  round  veg  yellow
    # item                                                 
    # apple        1     0       0    0      1    0       0
    # bananna      1     1       0    0      0    0       1
    # carrot       0     1       1    0      0    1       0
    # raddish      0     0       0    1      0    1       0
    

    【讨论】:

    • 使用来自其他答案的相同输入进行测试。奇怪的是,该输入的性能并没有那么高(279936 行乘 1000 列,非常稀疏)。
    【解决方案2】:

    纯蟒蛇:

    from itertools import chain
    
    def count(d):
        cols = set(chain(*d.values()))
        yield ['name'] + list(cols)
        for row, values in d.items():
            yield [row] + [(col in values) for col in cols]
    

    测试:

    >>> food2 = {           
        "apple": ["fruit", "round"],
        "bananna": ["fruit", "yellow", "long"],
        "carrot": ["veg", "orange", "long"],
        "raddish": ["veg", "red"]
    }
    
    >>> list(count(food2))
    [['name', 'long', 'veg', 'fruit', 'yellow', 'orange', 'round', 'red'],
     ['bananna', True, False, True, True, False, False, False],
     ['carrot', True, True, False, False, True, False, False],
     ['apple', False, False, True, False, False, True, False],
     ['raddish', False, True, False, False, False, False, True]]
    

    [更新]

    性能测试:

    >>> from itertools import product
    >>> labels = list("".join(_) for _ in product(*(["ABCDEF"] * 7)))
    >>> attrs = labels[:1000]
    >>> import random
    >>> sample = {}
    >>> for k in labels:
    ...     sample[k] = random.sample(attrs, 5)
    >>> import time
    >>> n = time.time(); list(count(sample)); print time.time() - n                                                                
    62.0367980003
    

    在我繁忙的机器上花了不到 2 分钟,279936 行乘 1000 列(许多 chrome 标签打开)。如果性能不可接受,请告诉我。

    [更新]

    从另一个答案测试性能:

    >>> n = time.time(); \
    ...     df = pd.DataFrame(dict([(k, pd.Series(v)) for k,v in sample.items()])); \
    ...     print time.time() - n
    72.0512290001
    

    下一行 (df = pd.melt(...)) 耗时太长,因此我取消了测试。对这个结果持保留态度,因为它是在一台繁忙的机器上运行的。

    【讨论】:

    • 优秀。您对这将如何在数十万个“水果”和数千个属性上执行(与一些尚未指定的 Pandas 魔法相比)有任何直觉吗?
    • 我“不得不”导入 itertools
    • 此解决方案针对简单性而非性能进行了优化。有很大的改进空间,特别是如果你事先知道属性。更新了缺少的“导入”。
    • 您能否比较一下应用于数据时的性能答案?
    猜你喜欢
    • 2014-06-12
    • 1970-01-01
    • 1970-01-01
    • 2019-02-22
    • 2017-07-16
    • 2020-12-22
    • 2021-11-14
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多