【问题标题】:Python Pandas groupby and mutate a new column with group wise calculations ala dplyrPython Pandas groupby 并通过分组计算改变一个新列 ala dplyr
【发布时间】:2017-06-02 06:20:03
【问题描述】:

我对用于数据分析的 R dplyr 相当熟悉,我正在尝试将我用 dplyr 编写的一些代码转换为 pandas。我有数据,其中包含由 ID 列标识的人员以及该人员使用特定产品的日期。我试图找出将以下 R dplyr 代码转换为 python pandas 代码的最佳方法。基本上,我按 ID 列分组,过滤特定类型的产品,然后添加一列,(对于组中的每一行)为该人添加最小(首次使用)日期和最大(最后使用)日期和产品。最后,我还添加了一个列,用于计算上次使用日期和首次使用日期之间的天数。这是数据:

ID  PRODUCT DATE
A   ITEM1   1/30/15
B   ITEM1   2/23/14
A   ITEM2   3/22/15
C   ITEM1   1/23/12
B   ITEM1   4/12/15
A   ITEM3   2/2/14
C   ITEM1   1/1/17
A   ITEM1   2/20/15
A   ITEM1   5/18/15

使用 dplyr 我可以做到

library(dplyr)
library(lubridate)

df <- df %>% 
mutate(DATE = mdy(DATE)) %>% 
group_by(ID) %>% 
filter(PRODUCT == "Item1") %>% 
mutate(FIRST = min(DATE), LAST = max(DATE), DAYS = LAST - FIRST)

这给了我

      ID PRODUCT       DATE      FIRST       LAST      DAYS
  (fctr)  (fctr)     (time)     (time)     (time)    (dfft)
1      A   ITEM1 2015-01-30 2015-01-30 2015-05-18  108 days
2      B   ITEM1 2014-02-23 2014-02-23 2015-04-12  413 days
3      C   ITEM1 2012-01-23 2012-01-23 2017-01-01 1805 days
4      B   ITEM1 2015-04-12 2014-02-23 2015-04-12  413 days
5      C   ITEM1 2017-01-01 2012-01-23 2017-01-01 1805 days
6      A   ITEM1 2015-02-20 2015-01-30 2015-05-18  108 days
7      A   ITEM1 2015-05-18 2015-01-30 2015-05-18  108 days

数据:

df <- structure(list(ID = structure(c(1L, 2L, 1L, 3L, 2L, 1L, 3L, 1L, 1L), .Label = c("A", "B", "C"), class = "factor"), 
               PRODUCT = structure(c(1L, 1L, 2L, 1L, 1L, 3L, 1L, 1L, 1L), .Label = c("ITEM1", "ITEM2", "ITEM3"), class = "factor"), 
               DATE = structure(c(3L, 6L, 7L, 2L, 8L, 4L, 1L, 5L, 9L), 
                                .Label = c("1/1/17", "1/23/12", "1/30/15", "2/2/14", "2/20/15", "2/23/14", "3/22/15", "4/12/15", "5/18/15"), 
                                class = "factor")), 
               .Names = c("ID", "PRODUCT", "DATE"), class = "data.frame", row.names = c(NA, -9L))

如何在 pandas 中做同样的事情?

【问题讨论】:

    标签: python r pandas dplyr


    【解决方案1】:

    使用agg + groupby

    funcs = dict(FIRST='min', LAST='max', DAYS=np.ptp)
    d1 = df.join(df.groupby(['ID', 'PRODUCT']).DATE.agg(funcs), on=['ID', 'PRODUCT'])
    

    【讨论】:

    • “转换”功能怎么样?
    • @Wen transform 函数呢?
    • @piRSquared 类似 df['SUM']=df.groupby('A')['B'].transform(sum),但是不知道这个函数是否可以添加多个结果一次。
    【解决方案2】:

    使用datar 将您的 R 代码转换为 python 非常容易:

    >>> from datar.all import f, tribble, as_date, group_by, mutate, min, max, filter
    [2021-06-24 13:44:46][datar][WARNING] Builtin name "min" has been overriden by datar.
    [2021-06-24 13:44:46][datar][WARNING] Builtin name "max" has been overriden by datar.
    [2021-06-24 13:44:46][datar][WARNING] Builtin name "filter" has been overriden by datar.
    >>> 
    >>> df = tribble(
    ...     f.ID,  f.PRODUCT, f.DATE,
    ...     "A",   "ITEM1",   "1/30/15",
    ...     "B",   "ITEM1",   "2/23/14",
    ...     "A",   "ITEM2",   "3/22/15",
    ...     "C",   "ITEM1",   "1/23/12",
    ...     "B",   "ITEM1",   "4/12/15",
    ...     "A",   "ITEM3",   "2/2/14",
    ...     "C",   "ITEM1",   "1/1/17",
    ...     "A",   "ITEM1",   "2/20/15",
    ...     "A",   "ITEM1",   "5/18/15",
    ... )
    >>> df
            ID  PRODUCT     DATE
      <object> <object> <object>
    0        A    ITEM1  1/30/15
    1        B    ITEM1  2/23/14
    2        A    ITEM2  3/22/15
    3        C    ITEM1  1/23/12
    4        B    ITEM1  4/12/15
    5        A    ITEM3   2/2/14
    6        C    ITEM1   1/1/17
    7        A    ITEM1  2/20/15
    8        A    ITEM1  5/18/15
    >>> df >> mutate(
    ...     DATE=as_date(f.DATE, "%m/%d/%y")
    ... ) >> group_by(
    ...     f.ID
    ... ) >> filter(
    ...     f.PRODUCT == "ITEM1"
    ... ) >> mutate(
    ...     FIRST=min(f.DATE), 
    ...     LAST=max(f.DATE), 
    ...     DAYS=f.LAST - f.FIRST
    ... )
            ID  PRODUCT        DATE       FIRST        LAST              DAYS
      <object> <object>    <object>    <object>    <object> <timedelta64[ns]>
    0        A    ITEM1  2015-01-30  2015-01-30  2015-05-18          108 days
    1        B    ITEM1  2014-02-23  2014-02-23  2015-04-12          413 days
    2        C    ITEM1  2012-01-23  2012-01-23  2017-01-01         1805 days
    3        B    ITEM1  2015-04-12  2014-02-23  2015-04-12          413 days
    4        C    ITEM1  2017-01-01  2012-01-23  2017-01-01         1805 days
    5        A    ITEM1  2015-02-20  2015-01-30  2015-05-18          108 days
    6        A    ITEM1  2015-05-18  2015-01-30  2015-05-18          108 days
    
    [Groups: ID (n=3)]
    

    免责声明:我是datar 包的作者。

    【讨论】:

      【解决方案3】:

      另一个选项,使用transform 函数,结合assign

      (df.loc[df.PRODUCT == 'ITEM1']
         .assign(first = lambda df: df.groupby('ID').DATE.transform('min'), 
                 last  = lambda df: df.groupby('ID').DATE.transform('max'), 
                 days  = lambda df: df['last'] - df['first'])
      ) 
        ID PRODUCT       DATE      first       last      days
      0  A   ITEM1 2015-01-30 2015-01-30 2015-05-18  108 days
      1  B   ITEM1 2014-02-23 2014-02-23 2015-04-12  413 days
      3  C   ITEM1 2012-01-23 2012-01-23 2017-01-01 1805 days
      4  B   ITEM1 2015-04-12 2014-02-23 2015-04-12  413 days
      6  C   ITEM1 2017-01-01 2012-01-23 2017-01-01 1805 days
      7  A   ITEM1 2015-02-20 2015-01-30 2015-05-18  108 days
      8  A   ITEM1 2015-05-18 2015-01-30 2015-05-18  108 days
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2019-01-26
        • 1970-01-01
        • 1970-01-01
        • 2017-02-22
        • 2021-12-13
        • 1970-01-01
        • 1970-01-01
        • 2023-03-14
        相关资源
        最近更新 更多