我建议:
import calendar # note only used to revert month to literal
import pandas as pd
from datetime import datetime
# Create a new dataFrame counting people for each day
# note the renaming
hourly_data = df.groupby(['month', 'day', 'hour'], as_index=False).count().rename(columns= {'Name': 'count'})
# Create a datetime column
# use pd.to_datetime on a string for the date and hour
# note insertions 1/ of a year, 2/ of minutes
hourly_data['DH'] = pd.to_datetime(hourly_data['day'].map(str) + " " + hourly_data['month'].map(str) + " " + '2021' + " " + hourly_data['hour'].map(str) + ":00")
# Keep only the needed columns
hourly_data = hourly_data[['DH', 'count']]
# Create the missing rows by setting an index with a hourly frequency
# note sorting to avoid some errors
hourly_data = hourly_data.set_index('DH').sort_index().asfreq('h')
# Fill the missing values for created rows with 0
hourly_data['count'].fillna(value = 0, inplace = True)
# Now we can revert count to int
hourly_data['count'] = hourly_data['count'].astype(int)
# Create columns from the index
hourly_data['year'], hourly_data['month'], hourly_data['day'], hourly_data['hour'] = hourly_data.index.year, hourly_data.index.month, hourly_data.index.day, hourly_data.index.hour
# and convert month to litteral
hourly_data['month'] = hourly_data['month'].apply(lambda i:calendar.month_name[i])
# Reorder columns
hourly_data = hourly_data[['year','month', 'day', 'hour', 'count']]
假设以下数据:
import io
s = """
Name month day hour
Albert October 31 5
John October 31 6
Jane October 31 6
Albert October 31 8
Jane October 31 23
Albert November 1 5
John November 1 6
Jane November 1 6
Albert November 1 8
Albert November 1 9
John November 1 10
Jane November 1 23
"""
df = pd.read_csv(io.StringIO(s), sep='\s+')
每小时数据:
year month day hour count
DH
2021-10-31 05:00:00 2021 October 31 5 1
2021-10-31 06:00:00 2021 October 31 6 2
2021-10-31 07:00:00 2021 October 31 7 0
2021-10-31 08:00:00 2021 October 31 8 1
2021-10-31 09:00:00 2021 October 31 9 0
2021-10-31 10:00:00 2021 October 31 10 0
2021-10-31 11:00:00 2021 October 31 11 0
2021-10-31 12:00:00 2021 October 31 12 0
2021-10-31 13:00:00 2021 October 31 13 0
2021-10-31 14:00:00 2021 October 31 14 0
2021-10-31 15:00:00 2021 October 31 15 0
2021-10-31 16:00:00 2021 October 31 16 0
2021-10-31 17:00:00 2021 October 31 17 0
2021-10-31 18:00:00 2021 October 31 18 0
2021-10-31 19:00:00 2021 October 31 19 0
2021-10-31 20:00:00 2021 October 31 20 0
2021-10-31 21:00:00 2021 October 31 21 0
2021-10-31 22:00:00 2021 October 31 22 0
2021-10-31 23:00:00 2021 October 31 23 1
2021-11-01 00:00:00 2021 November 1 0 0
2021-11-01 01:00:00 2021 November 1 1 0
2021-11-01 02:00:00 2021 November 1 2 0
2021-11-01 03:00:00 2021 November 1 3 0
2021-11-01 04:00:00 2021 November 1 4 0
2021-11-01 05:00:00 2021 November 1 5 1
2021-11-01 06:00:00 2021 November 1 6 2
2021-11-01 07:00:00 2021 November 1 7 0
2021-11-01 08:00:00 2021 November 1 8 1
2021-11-01 09:00:00 2021 November 1 9 1
2021-11-01 10:00:00 2021 November 1 10 1
2021-11-01 11:00:00 2021 November 1 11 0
2021-11-01 12:00:00 2021 November 1 12 0
2021-11-01 13:00:00 2021 November 1 13 0
2021-11-01 14:00:00 2021 November 1 14 0
2021-11-01 15:00:00 2021 November 1 15 0
2021-11-01 16:00:00 2021 November 1 16 0
2021-11-01 17:00:00 2021 November 1 17 0
2021-11-01 18:00:00 2021 November 1 18 0
2021-11-01 19:00:00 2021 November 1 19 0
2021-11-01 20:00:00 2021 November 1 20 0
2021-11-01 21:00:00 2021 November 1 21 0
2021-11-01 22:00:00 2021 November 1 22 0
2021-11-01 23:00:00 2021 November 1 23 1
并且 hourly_data 是类型
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 43 entries, 2021-10-31 05:00:00 to 2021-11-01 23:00:00
Freq: H
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 43 non-null int64
1 month 43 non-null object
2 day 43 non-null int64
3 hour 43 non-null int64
4 count 43 non-null int64
dtypes: int64(4), object(1)
memory usage: 2.0+ KB
上面的索引类型DatetimeIndex和它的频率Freq: H
保留索引是因为它对某些下游处理很有用。