在 pandas 数据框中对字符串进行分组答案

【问题标题】：Grouping strings on the pandas dataframe在 pandas 数据框中对字符串进行分组
【发布时间】：2020-05-22 12:28:56
【问题描述】：

我有以下数据框，其中包含来自气象站的信息：

      import pandas as pd
      import numpy as np

      df = pd.DataFrame({'Code Weather Station': ['1024', '1024', '1024', '2089', 
                                                  '2089', '2089', '8974'], 
                         'Instrumentation': ['Pluviometer-Analog', 'speedometer', 'incidence-sun',
                                             'speedometer', 'Pluviometer', 'speedometer', 
                                             'Pluviometer']})

我想对来自每个气象站的仪器进行分组。

我尝试使用groupby，连同sum()函数，如下：

      df_New = df.groupby('Code Weather Station', as_index=False)['Instrumentation'].sum()

结果符合预期。不过，我希望乐器之间有空格。

      print(df_New)

      Code Weather Station  Instrumentation
            1024             Pluviometer-Analogspeedometerincidence-sun
            2089             speedometerPluviometerspeedometer
            8974             Pluviometer

我希望输出是：

      Code Weather Station  Instrumentation
            1024             Pluviometer-Analog speedometer incidence-sun
            2089             speedometer Pluviometer speedometer
            8974             Pluviometer

谢谢。

【问题讨论】：

试试df.groupby('Code Weather Station')['Instrumentation'].apply(lambda x: ' '.join(x))
这能回答你的问题吗？ Concatenate strings from several rows using Pandas groupby
我试过了： df_New = df.groupby('Code Weather Station', as_index=False)['Instrumentation'].apply(lambda x: ' '.join(x)) 。但返回不是数据框类型。你有什么建议吗？
我也试过： df_New = pd.DataFrame(df.groupby('Code Weather Station')['Instrumentation'].apply(lambda x: ' '.join(x))) 。但是按列名索引很尴尬。

标签： python string pandas group-by

【解决方案1】：

哦！做一个reset_index() 喜欢：

df.groupby('Code Weather Station')['Instrumentation'].apply(lambda x: ' '.join(x)).reset_index()

【讨论】：

【解决方案2】：

您应该避免使用apply，因为它效率低下。你可以试试这个：-

import pandas as pd
import numpy as np

df = pd.DataFrame({'Code Weather Station': ['1024', '1024', '1024', '2089', 
                                          '2089', '2089', '8974'], 
                 'Instrumentation': ['Pluviometer-Analog', 'speedometer', 'incidence-sun',
                                     'speedometer', 'Pluviometer', 'speedometer', 
                                     'Pluviometer']})

def process(x):
    return " ".join(x)

df_new = df.groupby('Code Weather Station').agg({
        'Instrumentation': [('Instrumentation', process)]
    })
df_new.columns = df_new.columns.droplevel()
df_new

【讨论】：

.agg 当你有 cython 优化的内置函数时效率更高，AFAIK。自定义函数如何更有效？有什么可以分享的链接吗？
是的。它始终建议避免使用apply，因为它只是一个python for 循环，而是使用map，它是一个矢量化实现并且比apply 快得多。 agg 在内部使用 map（你可以查看 pandas github）。但是在某些情况下apply 是无法避免的（例如，同时处理多个列）。但是对于处理单个列，使用apply 是没有用的。希望这会有所帮助。