Python Pandas：通过重复项将列组合在一起，并在相应列中连接字符串答案

【问题标题】：Python Pandas: Group a column together by duplicates and join strings within a corresponding columnPython Pandas：通过重复项将列组合在一起，并在相应列中连接字符串
【发布时间】：2021-12-01 09:20:51
【问题描述】：

我正在寻找对 PO Header Id 进行分组或（groupby），然后为 PO Header ID 相同的所有行连接字符串 (XML)。我偶然发现了一些代码示例，但遇到了一些错误。

最终，Final_XML 列是我想要实现的目标。

PO Header ID   XML   Combined_XML
123           <test1> 
123           <test2> 
456           <test3> 
567           <test4> 
567           <test5> 
567           <test6> 

Desired output
PO Header ID   Combined_XML
123            <test1><test2>
456            <test3>
567            <test4><test5><test6>

这是我迄今为止尝试过的：

    combineXML = df.groupby(['PO Header Id']).agg(['Combined_XML']).apply(list).reset_index()
    print(combineXML)
    Throws error: KeyError: 'PO Header Id' There are no spaces in the column name so I am not sure 
    why it is not working
    df = df.groupby(['PO Header Id','XML'])['Combined_XML'].apply(''.join).reset_index()

【问题讨论】：

让我知道我的答案是否适合您，或者需要任何微调？谢谢！
嗨，斯科特，请反馈我的答案是否适合您，或者需要任何微调？谢谢！

标签： python pandas csv pandas-groupby data-science

【解决方案1】：

您可以将.GroupBy.agg() 与named aggregation 一起使用，如下所示：

combineXML = df.groupby('PO Header ID', as_index=False).agg(Combined_XML=('XML', ''.join))

如果您的列名实际上是PO Header Id，则使用以下内容：

combineXML = df.groupby('PO Header Id', as_index=False).agg(Combined_XML=('XML', ''.join))

结果：

print(combineXML)


   PO Header ID           Combined_XML
0           123         <test1><test2>
1           456                <test3>
2           567  <test4><test5><test6>

【讨论】：

【解决方案2】：

你可以这样试试 df.groupby(['PO Header ID'])['XML'].apply(''.join).reset_index()

【讨论】：

文件 ".\Prepare-Data.py"，第 58 行，在 groupPOsAndMergeXML combineXML = df.groupby(['PO Header ID'])['XML'].apply(''.join ).reset_index() 文件“C:\Program Files\Python37\lib\site-packages\pandas\core\frame.py”，第 7636 行，在 groupby dropna=dropna，文件“C:\Program Files\Python37\lib \site-packages\pandas\core\groupby\groupby.py"，第 896 行，在 init 中 dropna=self.dropna，文件 "C:\Program Files\Python37\lib\site-packages\ pandas\core\groupby\grouper.py"，第 860 行，在 get_grouper 中引发 KeyError(gpr) KeyError: 'PO Header ID'
@Scott 似乎错误是由于您作为示例数据发布的列标签实际上是 'PO Header Id' 而不是 'PO Header ID'（Id 而不是 ID）。修复此问题后，此解决方案应该可以工作，除了列标签不是 Combined_XML 与您预期的结果一样。如果您想要确切的列标签，请参阅我的答案。