Pandas - 等效的 SQL 案例语句答案

【问题标题】：Pandas - SQL case statement equivalentPandas - 等效的 SQL 案例语句
【发布时间】：2016-04-19 15:59:16
【问题描述】：

注意：除了大型连接之外，还寻求一些有效方法的帮助，然后计算日期之间的差异

我有 table1 带有国家 ID 和日期（这些值没有重复），我想总结 table2 信息（其中有国家、日期、cluster_x 和一个计数变量，其中 cluster_x 是 cluster_1、cluster_2、 cluster_3) 以便table1 附加了集群 ID 的每个值和来自table2 的汇总计数，其中来自table2 的日期发生在table1 的日期之前的30 天内。

我相信这在 SQL 中很简单：如何在 Pandas 中做到这一点？

select a.date,a.country, 
sum(case when a.date - b.date between  1 and 30 then b.cluster_1 else 0 end) as cluster1,
sum(case when a.date - b.date between  1 and 30 then b.cluster_2 else 0 end) as cluster2,
sum(case when a.date - b.date between  1 and 30 then b.cluster_3 else 0 end) as cluster3

from  table1 a
left outer join table2 b
on a.country=b.country

group by a.date,a.country

编辑：

这是一个稍有改动的示例。假设这是 table1，一个包含日期、城市、集群和计数的聚合数据集。下面是“查询”数据集（表 2）。在这种情况下，只要 table1 中的 date 字段在 30 天前，我们希望将 table1 中的 count 字段与 cluster1、cluster2、cluster3 （实际上有 100 个）对应的国家 id 相加。

例如，查询数据集的第一行的日期为 2/2/2015 和国家/地区 1。在表 1 中，只有 30 天前的一行，它适用于计数为 2 的集群 2。

这是 CSV 中两个表的转储：

date,country,cluster,count
2014-01-30,1,1,1
2015-02-03,1,1,3
2015-01-30,1,2,2
2015-04-15,1,2,5
2015-03-01,2,1,6
2015-07-01,2,2,4
2015-01-31,2,3,8
2015-01-21,2,1,2
2015-01-21,2,1,3

和表2：

date,country
2015-02-01,1
2015-04-21,1
2015-02-21,2

【问题讨论】：

能否请您发布示例输入数据集（CSV/dict/JSON/Python代码格式作为文本的5-7行，所以我们编码时可以使用它）？ How to create a Minimal, Complete, and Verifiable example
现在好多了，但是你改变了算法——你想把cluster_X和table2相加还是count和table1相加？您能否也发布所需的输出？
这里是SQLFiddle，您可以在其中使用 SQL 开发所需的结果，并在此处发布链接（包含所需的 SQL）。 PS我使用this service从CSV生成SQL
所需的输出是表（cluster_1....cluster_3）值的底部结果。它是计数变量的总和。我想我实际上有一个可行的方法.....但是它很慢。
日期范围重叠怎么办？例如，如果您将 [2015-02-02, 1] 添加到 table2，您的结果集将如何显示

标签： python pandas

【解决方案1】：

编辑：糟糕——希望我能在提交之前看到关于加入的编辑。 Np，我会留下这个，因为它是有趣的练习。欢迎批评。

如果 table1 和 table2 位于与此脚本相同的目录中的“table1.csv”和“table2.csv”，这应该可以工作。

我在 30 天时没有得到与您的示例相同的结果 - 不得不将其增加到 31 天，但我认为精神就在这里：

import pandas as pd
import numpy as np

table1_path = './table1.csv'
table2_path = './table2.csv'

with open(table1_path) as f:
    table1 = pd.read_csv(f)
table1.date = pd.to_datetime(table1.date)

with open(table2_path) as f:
    table2 = pd.read_csv(f)
table2.date = pd.to_datetime(table2.date)

joined = pd.merge(table2, table1, how='outer', on=['country'])

joined['datediff'] = joined.date_x - joined.date_y

filtered = joined[(joined.datediff >= np.timedelta64(1, 'D')) & (joined.datediff <= np.timedelta64(31, 'D'))]

gb_date_x = filtered.groupby(['date_x', 'country', 'cluster'])

summed = pd.DataFrame(gb_date_x['count'].sum())

result = summed.unstack()
result.reset_index(inplace=True)
result.fillna(0, inplace=True)

我的测试输出：

ipdb> table1
                 date  country  cluster  count
0 2014-01-30 00:00:00        1        1      1
1 2015-02-03 00:00:00        1        1      3
2 2015-01-30 00:00:00        1        2      2
3 2015-04-15 00:00:00        1        2      5
4 2015-03-01 00:00:00        2        1      6
5 2015-07-01 00:00:00        2        2      4
6 2015-01-31 00:00:00        2        3      8
7 2015-01-21 00:00:00        2        1      2
8 2015-01-21 00:00:00        2        1      3
ipdb> table2
                 date  country
0 2015-02-01 00:00:00        1
1 2015-04-21 00:00:00        1
2 2015-02-21 00:00:00        2

...

ipdb> result
                     date_x  country  count
cluster                                   1  2  3
0       2015-02-01 00:00:00        1      0  2  0
1       2015-02-21 00:00:00        2      5  0  8
2       2015-04-21 00:00:00        1      0  5  0

【讨论】：

【解决方案2】：

更新：

我认为使用 pandas 处理无法放入内存的数据没有多大意义。当然有一些技巧可以解决这个问题，但是很痛苦。

如果您想有效地处理数据，您应该使用适当的工具。

我建议仔细查看Apache Spark SQL，您可以在其中处理多个集群节点上的分布式数据，使用更多的内存/处理能力/IO/等。与一台计算机/IO 子系统/CPU pandas 方法相比。

您也可以尝试使用像 Oracle DB 这样的 RDBMS（非常昂贵，尤其是软件许可证！而且他们的免费版本充满了限制）或像 PostgreSQL 这样的免费替代品（不能说太多，因为缺乏经验）或 MySQL（与 Oracle 相比没有那么强大；例如，没有您最可能想要使用的动态旋转的原生/清晰解决方案等）

旧答案：

你可以这样做（请在代码中找到解释为 cmets）：

#
# <setup>
#
dates1 = pd.date_range('2016-03-15','2016-04-15')
dates2 = ['2016-02-01', '2016-05-01', '2016-04-01', '2015-01-01', '2016-03-20']
dates2 = [pd.to_datetime(d) for d in dates2]

countries = ['c1', 'c2', 'c3']

t1 = pd.DataFrame({
    'date': dates1,
    'country': np.random.choice(countries, len(dates1)),
    'cluster': np.random.randint(1, 4, len(dates1)),
    'count': np.random.randint(1, 10, len(dates1))
})
t2 = pd.DataFrame({'date': np.random.choice(dates2, 10), 'country': np.random.choice(countries, 10)})
#
# </setup>
#

# merge two DFs by `country`
merged = pd.merge(t1.rename(columns={'date':'date1'}), t2, on='country')

# filter dates and drop 'date1' column
merged = merged[(merged.date <= merged.date1 + pd.Timedelta('30days'))\
                & \
                (merged.date >= merged.date1)
               ].drop(['date1'], axis=1)

# group `merged` DF by ['country', 'date', 'cluster'],
# sum up `counts` for overlapping dates, 
# reset the index,
# pivot: convert `cluster` values to columns,
#        taking sum's of `count` as values,
#        NaN's will be replaced with zeroes
# and finally reset the index 
r = merged.groupby(['country', 'date', 'cluster'])\
          .sum()\
          .reset_index()\
          .pivot_table(index=['country','date'],
                       columns='cluster',
                       values='count',
                       aggfunc='sum',
                       fill_value=0)\
          .reset_index()

# rename numeric columns to: 'cluster_N'
rename_cluster_cols = {x: 'cluster_{0}'.format(x) for x in t1.cluster.unique()}
r = r.rename(columns=rename_cluster_cols)

输出（用于我的数据集）：

In [124]: r
Out[124]:
cluster country       date  cluster_1  cluster_2  cluster_3
0            c1 2016-04-01          8          0         11
1            c2 2016-04-01          0         34         22
2            c3 2016-05-01          4         18         36

【讨论】：

感谢 maxU，但这与大型合并不同。我的实际数据不适合这样的内存。我想我可以遍历它的批次...
如果您的数据不适合内存，您可以使用dask.dataframe。它复制了 pandas 语法，但为您提供了分块算法来处理这种情况。在幕后，它仍在使用 pandas，但会为您处理分块/执行/合并。这是一个较新的库，但对于“中等数据”领域（即数据太大而无法放入内存，但还不需要 Hadoop 集群）已经非常有用。