【问题标题】:Pandas equivalent to SQL window functionsPandas 相当于 SQL 窗口函数
【发布时间】:2017-05-25 04:51:31
【问题描述】:

Pandas 中是否有与 SQL 的窗口函数等效的惯用方法?例如,在 Pandas 中编写等价物的最紧凑的方法是什么?:

SELECT state_name,  
       state_population,
       SUM(state_population)
        OVER() AS national_population
FROM population   
ORDER BY state_name 

还是这个?:

SELECT state_name,  
       state_population,
       region,
       SUM(state_population)
        OVER(PARTITION BY region) AS regional_population
FROM population    
ORDER BY state_name

【问题讨论】:

  • 你能提供一个样本数据集和想要的数据集吗?
  • @JackManey,AFAIK 不太一样 - 至少对于提到的 SQL...
  • @JackManey Pandas 文档中的窗口函数是 SQL 窗口函数所具有的功能的子集。基本上我想做的是在不减少数据帧的情况下计算聚合。

标签: python sql pandas window-functions


【解决方案1】:

对于第一个 SQL:

SELECT state_name,  
       state_population,
       SUM(state_population)
        OVER() AS national_population
FROM population   
ORDER BY state_name 

熊猫:

df.assign(national_population=df.state_population.sum()).sort_values('state_name')

对于第二个 SQL:

SELECT state_name,  
       state_population,
       region,
       SUM(state_population)
        OVER(PARTITION BY region) AS regional_population
FROM population    
ORDER BY state_name

熊猫:

df.assign(regional_population=df.groupby('region')['state_population'].transform('sum')) \
  .sort_values('state_name')

演示:

In [238]: df
Out[238]:
   region state_name  state_population
0       1        aaa               100
1       1        bbb               110
2       2        ccc               200
3       2        ddd               100
4       2        eee               100
5       3        xxx                55

国家人口:

In [246]: df.assign(national_population=df.state_population.sum()).sort_values('state_name')
Out[246]:
   region state_name  state_population  national_population
0       1        aaa               100                  665
1       1        bbb               110                  665
2       2        ccc               200                  665
3       2        ddd               100                  665
4       2        eee               100                  665
5       3        xxx                55                  665

regional_population:

In [239]: df.assign(regional_population=df.groupby('region')['state_population'].transform('sum')) \
     ...:   .sort_values('state_name')
Out[239]:
   region state_name  state_population  regional_population
0       1        aaa               100                  210
1       1        bbb               110                  210
2       2        ccc               200                  400
3       2        ddd               100                  400
4       2        eee               100                  400
5       3        xxx                55                   55

【讨论】:

猜你喜欢
  • 1970-01-01
  • 2021-05-03
  • 1970-01-01
  • 2016-04-16
  • 2018-11-30
  • 1970-01-01
  • 2018-11-07
  • 2022-01-02
  • 2010-10-19
相关资源
最近更新 更多