【问题标题】:Split pandas dataframe column list values to duplicate rows [duplicate]将熊猫数据框列列表值拆分为重复行[重复]
【发布时间】:2019-12-28 06:10:33
【问题描述】:

我有一个如下所示的数据框:

publication_title    authors                             type ...
title 1              ['author1', 'author2', 'author3']   proceedings
title 2              ['author4', 'author5']              collections
title 3              ['author6', 'author7']              books
.
.
. 

我想要做的是获取列'authors'并通过复制所有其他列将其中的列表分成几行,我还想将结果存储在一个名为:'author'的新列中并保留原始列。

以下内容正是我想要实现的目标:

publication_title    authors                             author          type ...
title 1              ['author1', 'author2', 'author3']   author1         proceedings
title 1              ['author1', 'author2', 'author3']   author2         proceedings
title 1              ['author1', 'author2', 'author3']   author3         proceedings
title 2              ['author4', 'author5']              author4         collections
title 2              ['author4', 'author5']              author5         collections
title 3              ['author6', 'author7']              author6         books
title 3              ['author6', 'author7']              author7         books
.
.
. 

我曾尝试使用 pandas DataFrame 的 explode 方法来实现这一点,但我找不到将结果存储在新列中的方法。

感谢您的帮助。

【问题讨论】:

    标签: python-3.x pandas dataframe


    【解决方案1】:

    因为pandas 0.25.0 我们有了explode 方法。首先我们复制authors 列并同时使用assign 重命名它,然后我们将此列分解为行并复制其他列:

    df.assign(author=df['authors']).explode('author')
    

    输出

      publication_title                      authors         type   author
    0           title_1  [author1, author2, author3]  proceedings  author1
    0           title_1  [author1, author2, author3]  proceedings  author2
    0           title_1  [author1, author2, author3]  proceedings  author3
    1           title_2           [author4, author5]  collections  author4
    1           title_2           [author4, author5]  collections  author5
    2           title_3           [author6, author7]        books  author6
    2           title_3           [author6, author7]        books  author7
    

    如果要删除重复索引,请使用reset_index

    df.assign(author=df['authors']).explode('author').reset_index(drop=True)
    

    输出

      publication_title                      authors         type   author
    0           title_1  [author1, author2, author3]  proceedings  author1
    1           title_1  [author1, author2, author3]  proceedings  author2
    2           title_1  [author1, author2, author3]  proceedings  author3
    3           title_2           [author4, author5]  collections  author4
    4           title_2           [author4, author5]  collections  author5
    5           title_3           [author6, author7]        books  author6
    6           title_3           [author6, author7]        books  author7
    

    【讨论】:

    • 谢谢@Erfan,旅游解决方案正是我想要的。
    【解决方案2】:

    您可以先与作者创建一个新的DataFrame

    df2 = pd.DataFrame(df['author'].tolist(), index=df.index).stack()
    

    接下来我们删除二级索引:

    df2.index = df2.index.droplevel(1)
    

    接下来我们可以在第二个轴上连接:

    >>> pd.concat([df, df2], axis=1)
         title                       author         type        0
    0  title 1  [author1, author2, author3]  proceedings  author1
    0  title 1  [author1, author2, author3]  proceedings  author2
    0  title 1  [author1, author2, author3]  proceedings  author3
    1  title 2           [author4, author5]  collections  author4
    1  title 2           [author4, author5]  collections  author5
    2  title 3           [author6, author7]        books  author6
    2  title 3           [author6, author7]        books  author7
    

    或单线:

    >>> pd.concat([df, pd.DataFrame(df['author'].tolist(), index=df.index).stack().reset_index(level=1, drop=True)], axis=1)
         title                       author         type        0
    0  title 1  [author1, author2, author3]  proceedings  author1
    0  title 1  [author1, author2, author3]  proceedings  author2
    0  title 1  [author1, author2, author3]  proceedings  author3
    1  title 2           [author4, author5]  collections  author4
    1  title 2           [author4, author5]  collections  author5
    2  title 3           [author6, author7]        books  author6
    2  title 3           [author6, author7]        books  author7
    

    【讨论】:

      【解决方案3】:

      您已发现explode,这意味着您快到了!只需将原始数据与分解数据合并,请参见下面的代码,

      # data
      df = pd.DataFrame({'publication_title':['title_1','title_2','title_3'],
                    'authors':[['author1', 'author2', 'author3'],['author4', 'author5'],['author6', 'author7']],
                    'type':['proceedings','collections','books']})
      
      (df.explode(column='authors')
         .rename(columns={'authors':'author'})
         .merge(df))
      

      【讨论】:

        猜你喜欢
        • 2018-07-26
        • 1970-01-01
        • 2019-09-30
        • 2018-10-17
        • 2021-10-28
        • 2018-10-09
        • 1970-01-01
        • 2020-01-08
        • 2019-05-25
        相关资源
        最近更新 更多