【问题标题】:Python pandas explode (one to many relationship)Python pandas 爆炸(一对多关系)
【发布时间】:2020-02-14 20:24:55
【问题描述】:

假设我有以下数据框,其中包含 name、preference、fruits 列:

name   preference   fruits
adam    likes       apples
mike   dislikes     orange

如果上面的数据框有一对多的关系,如列 name 将与列 preference, fruits 有多重关系。例如我正在寻找的输出数据框是:

name   preference   fruits
adam    likes       apples
adam    likes       orange
adam    dislikes    apple
adam    dislikes    orange
mike    likes       apples
mike    likes       orange
mike    dislikes    apple
mike    dislikes    orange

想知道是否可行。根据我到目前为止对熊猫的了解,我相信我将不得不使用 groupby? 任何帮助表示赞赏! 谢谢!

【问题讨论】:

    标签: python python-3.x pandas dataframe pandas-groupby


    【解决方案1】:

    只是叉积吗:

    (pd.MultiIndex.from_product([df[col] for col in df],
                               names=df.columns)
       .to_frame().reset_index(drop=True)
    )
    

    输出:

       name preference  fruits
    0  adam      likes  apples
    1  adam      likes  orange
    2  adam   dislikes  apples
    3  adam   dislikes  orange
    4  mike      likes  apples
    5  mike      likes  orange
    6  mike   dislikes  apples
    7  mike   dislikes  orange
    

    【讨论】:

      【解决方案2】:

      我会使用itertools.product

      import pandas as pd
      from itertools import product
      
      
      df = pd.DataFrame({
          'name': ['adam', 'mike'],
          'preference': ['likes', 'dislikes'],
          'fruits': ['apples', 'oranges']
      })
      
      ndf = pd.DataFrame(
          product(*[df[c] for c in df.columns]),
          columns=df.columns
      )
      
      print(ndf)
      #    name preference   fruits
      # 0  adam      likes   apples
      # 1  adam      likes  oranges
      # 2  adam   dislikes   apples
      # 3  adam   dislikes  oranges
      # 4  mike      likes   apples
      # 5  mike      likes  oranges
      # 6  mike   dislikes   apples
      # 7  mike   dislikes  oranges
      

      至于速度,这似乎也快了一点。

      %%timeit
      ndf = pd.DataFrame(
          product(*[df[c] for c in df.columns]),
          columns=df.columns
      )
      # 624 µs ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
      
      
      %%timeit
      (pd.MultiIndex.from_product([df[col] for col in df],
                                 names=df.columns)
         .to_frame().reset_index(drop=True)
      )
      # 3.51 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2020-04-10
        • 2022-11-03
        • 1970-01-01
        • 1970-01-01
        • 2016-02-08
        • 1970-01-01
        • 2016-08-02
        • 2022-06-13
        相关资源
        最近更新 更多