【发布时间】:2020-12-14 18:31:25
【问题描述】:
我有以下数据框
import pandas as pd
from datetime import datetime
df_dict = {
'id':[1,1,1,1,2,2,2,2],
'start_time':[
datetime.strptime('Jun 1 2020 1:30PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020 2:30PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020 3:30PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020 4:30PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020 1:30PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020 2:30PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020 3:30PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020 4:30PM', '%b %d %Y %I:%M%p'),
],
'end_time':[
datetime.strptime('Jun 1 2020 2:45PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020 3:00PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020 4:50PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020 4:30PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020 3:45PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020 5:00PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020 5:50PM', '%b %d %Y %I:%M%p'),
datetime.strptime('Jun 1 2020 6:30PM', '%b %d %Y %I:%M%p'),
]
}
df = pd.DataFrame.from_dict(df_dict)
# id start_time end_time
# 0 1 2020-06-01 13:30:00 2020-06-01 14:45:00
# 1 1 2020-06-01 14:30:00 2020-06-01 15:00:00
# 2 1 2020-06-01 15:30:00 2020-06-01 16:50:00
# 3 1 2020-06-01 16:30:00 2020-06-01 16:30:00
# 4 2 2020-06-01 13:30:00 2020-06-01 15:45:00
# 5 2 2020-06-01 14:30:00 2020-06-01 17:00:00
# 6 2 2020-06-01 15:30:00 2020-06-01 17:50:00
# 7 2 2020-06-01 16:30:00 2020-06-01 18:30:00
我想计算每个 id 的总小时数,而不重复计算重叠间隔。
我有下面的代码,它给出了正确的结果
import sqlite3
conn = sqlite3.connect(':memory:')
df.to_sql('df', conn, index=False)
query = '''
SELECT id, SUM(CAST((JulianDay(end_time)-JulianDay(start_time))*24 AS real)) AS total_hours
FROM (
SELECT s1.id,
s1.start_time,
MIN(t1.end_time) AS end_time
FROM df s1
INNER JOIN df t1 ON s1.start_time <= t1.end_time
AND s1.id = t1.id
AND NOT EXISTS(SELECT * FROM df t2
WHERE t1.end_time >= t2.start_time AND t1.end_time < t2.end_time AND t2.id = t1.id)
WHERE NOT EXISTS(SELECT * FROM df s2
WHERE s1.start_time > s2.start_time AND s1.start_time <= s2.end_time AND s2.id = t1.id)
GROUP BY s1.start_time, s1.id
ORDER BY s1.id, s1.start_time
) x
GROUP BY id
'''
df = pd.read_sql_query(query, conn)
print(df)
# id total_hours
# 0 1 2.833333
# 1 2 5.000000
但我想知道是否有更好/更优雅的方法来解决这个问题,而不使用 SQL。
【问题讨论】:
-
"计算每个 id 的总小时数" =>
df.groupby('id')然后用.apply()或类似方法计算您的聚合。另一个提示是您的 SQL 包含GROUP BY s1.start_time, s1.id