第一个数据(注意:其他部分与更新有关)
数据非常有限,可能是由于简化它的复杂性,所以我会做一些假设并尽可能通用地编写它,以便您可以根据需要快速自定义它。
假设:
- 您希望按小时窗口(“hour_code”)对数据进行分组(因此将分组的数据参数化为
group_divide_set_by_column)
- 对于每个小时窗口(“hour_code”),您希望使用 K 均值算法按位置进行聚类
这样做可以让您分别调查每个小时窗口的车辆集群,并了解哪些集群区域更活跃且需要关注。
注意事项:
- Location 列(尽管已注明)缺失,并且是 K-means 算法所必需的(我使用了
HostName_key,但它只是一个虚拟的,因此代码可以运行,它不一定有意义)。
一般来说,K-means algorithm 是用于欧几里得距离的空间(在数学上,这意味着根据手段生成的 Voronoi 图对观察结果进行分区。)
以下是一些有助于进一步自定义 k-means Python 示例的来源:1234。
代码:
- 让我们定义一个函数,它给定一个数据框组,将它除以给定的列
group_divide_set_by_column。
这将允许我们按“hour_code”进行分组,然后按位置进行聚类。
def create_clusters_by_group(df, group_divide_set_by_column='hour_code', clusters_number_list=[2, 3]):
# Divide et by hours
divide_df_by_hours(df)
lst_df_by_groups = {f'{group_divide_set_by_column}_{i}': d for i, (g, d) in enumerate(df.groupby(group_divide_set_by_column))}
# For each group dataframe
for group_df_name, group_df in lst_df_by_groups.items():
# Divide to desired amount of clusters
for clusters_number in clusters_number_list:
create_cluster(group_df, clusters_number)
# Setting column types
set_colum_types(group_df)
return lst_df_by_groups
- #1 函数将使用另一个函数将
hour 转换为hour codes,类似于您的措辞:
时间段的意思,比如 1 是早上 6 点之前,2 - 6 点到 9 点,3 - 9 点到 11 点,4 - 11 点到 14 点,等等。
def divide_df_by_hours(df):
def get_hour_code(h, start_threshold=6, end_threshold=21, windows=3):
"""
Divide hours to groups:
Hours:
1-5 => 1
6-8 => 2
9-11 => 3
12-14 => 4
15-17 => 5
18-20 => 6
21+ => 7
"""
if h < start_threshold:
return 1
elif h >= end_threshold:
return (end_threshold // windows)
return h // windows
df['hour_code'] = df['starthour'].apply(lambda h : get_hour_code(h))
- 此外,#1 函数将使用
set_colum_types 函数将列转换为其匹配类型:
def set_colum_types(df):
types_dict = {
'Startdtm': 'datetime64[ns, Australia/Melbourne]',
'HostName_key': 'category',
'Totalvehicles': 'int32',
'Enddtm': 'datetime64[ns, Australia/Melbourne]',
'starthour': 'int32',
'timedelta': 'float',
'vehiclespersec': 'float',
}
for col, col_type in types_dict.items():
df[col] = df[col].astype(col_type)
- 一个专用的
timeit 装饰器用于测量每个集群的时间,因此减少了样板代码
完整代码:
import functools
import pandas as pd
from timeit import default_timer as timer
import sklearn
from sklearn.cluster import KMeans
def timeit(func):
@functools.wraps(func)
def newfunc(*args, **kwargs):
startTime = timer()
func(*args, **kwargs)
elapsedTime = timer() - startTime
print('function [{}] finished in {} ms'.format(
func.__name__, int(elapsedTime * 1000)))
return newfunc
def set_colum_types(df):
types_dict = {
'Startdtm': 'datetime64[ns, Australia/Melbourne]',
'HostName_key': 'category',
'Totalvehicles': 'int32',
'Enddtm': 'datetime64[ns, Australia/Melbourne]',
'starthour': 'int32',
'timedelta': 'float',
'vehiclespersec': 'float',
}
for col, col_type in types_dict.items():
df[col] = df[col].astype(col_type)
@timeit
def create_cluster(df, clusters_number):
# Create K-Means model
model = KMeans(n_clusters=clusters_number, max_iter=600, random_state=9)
# Fetch location
# NOTE: Should be a *real* location, used another column as dummy
location_df = df[['HostName_key']]
kmeans = model.fit(location_df)
# Divide to clusters
df[f'kmeans_{clusters_number}'] = kmeans.labels_
def divide_df_by_hours(df):
def get_hour_code(h, start_threshold=6, end_threshold=21, windows=3):
"""
Divide hours to groups:
Hours:
1-5 => 1
6-8 => 2
9-11 => 3
12-14 => 4
15-17 => 5
18-20 => 6
21+ => 7
"""
if h < start_threshold:
return 1
elif h >= end_threshold:
return (end_threshold // windows)
return h // windows
df['hour_code'] = df['starthour'].apply(lambda h : get_hour_code(h))
def create_clusters_by_group(df, group_divide_set_by_column='hour_code', clusters_number_list=[2, 3]):
# Divide et by hours
divide_df_by_hours(df)
lst_df_by_groups = {f'{group_divide_set_by_column}_{i}': d for i, (g, d) in enumerate(df.groupby(group_divide_set_by_column))}
# For each group dataframe
for group_df_name, group_df in lst_df_by_groups.items():
# Divide to desired amount of clusters
for clusters_number in clusters_number_list:
create_cluster(group_df, clusters_number)
# Setting column types
set_colum_types(group_df)
return lst_df_by_groups
# Load data
df = pd.read_csv('data.csv')
# Print data
print(df)
# Create clusters
lst_df_by_groups = create_clusters_by_group(df)
# For each hostname-key dataframe
for group_df_name, group_df in lst_df_by_groups.items():
print(f'Group {group_df_name} dataframe:')
print(group_df)
示例输出:
Startdtm HostName_key ... timedelta vehiclespersec
0 2020-01-15 08:22:39 0 ... 29400.0 0.326633
1 2020-01-13 08:22:07 2 ... 28981.0 0.354474
2 2020-01-23 07:16:55 3 ... 26149.0 0.197904
3 2020-01-15 07:00:06 4 ... 2783.0 0.301114
4 2020-01-15 08:16:01 1 ... 15915.0 0.366949
5 2020-01-16 08:22:39 2 ... 29400.0 0.326633
6 2020-01-14 08:22:07 2 ... 28981.0 0.354479
7 2020-01-25 07:16:55 4 ... 26149.0 0.197904
8 2020-01-17 07:00:06 1 ... 2783.0 0.301114
9 2020-01-18 08:16:01 1 ... 15915.0 0.366949
[10 rows x 7 columns]
function [create_cluster] finished in 10 ms
function [create_cluster] finished in 11 ms
function [create_cluster] finished in 10 ms
function [create_cluster] finished in 11 ms
function [create_cluster] finished in 10 ms
function [create_cluster] finished in 11 ms
Group hour_code_0 dataframe:
Startdtm HostName_key ... kmeans_2 kmeans_3
0 2020-01-15 08:22:39+11:00 0 ... 1 1
1 2020-01-13 08:22:07+11:00 2 ... 0 0
2 2020-01-23 07:16:55+11:00 3 ... 0 2
[3 rows x 10 columns]
Group hour_code_1 dataframe:
Startdtm HostName_key ... kmeans_2 kmeans_3
3 2020-01-15 07:00:06+11:00 4 ... 1 1
4 2020-01-15 08:16:01+11:00 1 ... 0 0
5 2020-01-16 08:22:39+11:00 2 ... 0 2
[3 rows x 10 columns]
Group hour_code_2 dataframe:
Startdtm HostName_key ... kmeans_2 kmeans_3
6 2020-01-14 08:22:07+11:00 2 ... 1 2
7 2020-01-25 07:16:55+11:00 4 ... 0 0
8 2020-01-17 07:00:06+11:00 1 ... 1 1
9 2020-01-18 08:16:01+11:00 1 ... 1 1
[4 rows x 10 columns]
更新:第二个数据
因此,这次会有所不同,因为更新后的目标是了解每个地方有多少车辆以及它们的速度。
再一次,为了便于适应,事情的编写非常谨慎。
- 首先,我们根据主机名推断出的数据集位置(参数化为
dividing_colum)将数据集划分为组。
def divide_df_by_column(df, dividing_colum='Hostname'):
df_by_groups = {f'{dividing_colum}_{g}': d for i, (g, d) in enumerate(df.groupby(dividing_colum))}
return df_by_groups
- 现在,我们为每个主机名 (
dividing_colum) 组排列数据:
def arrange_groups_df(lst_df_by_groups):
df_by_intervaled_group = dict()
# For each group dataframe
for group_df_name, group_df in lst_df_by_groups.items():
df_by_intervaled_group[group_df_name] = arrange_data(group_df)
return df_by_intervaled_group
2.1。我们以 15 分钟为间隔进行分组,将每个主机名区域数据划分为时间间隔后,将车辆数量汇总到列 volume 并调查平均速度到列 average_speed。
def group_by_interval(df):
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
intervaled_df = df.groupby([pd.Grouper(key=DATE_COLUMN_NAME, freq=INTERVAL_WINDOW)]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
return intervaled_df
def arrange_data(df):
df = group_by_interval(df)
return df
第 2 阶段的最终结果是将每个主机名数据划分为 15 分钟的时间窗口,我们知道每次经过的车辆数量以及它们的平均速度。
这样,我们就达到了目的:
另一个贪婪的问题是 - 我如何将速度插入到这个测量中?即,体积大但速度低,也可以满足需求。
同样,所有都可以使用 [TIME_INTERVAL_COLUMN_NAME, DATE_COLUMN_NAME, INTERVAL_WINDOW] 进行自定义。
整个代码:
import functools
import numpy
import pandas as pd
TIME_INTERVAL_COLUMN_NAME = 'time_interval'
DATE_COLUMN_NAME = 'DateTimeStamp'
INTERVAL_WINDOW = '15Min'
def round_time(df):
# Setting date_column_name to be of dateime
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
# Grouping by interval
df[TIME_INTERVAL_COLUMN_NAME] = df[DATE_COLUMN_NAME].dt.round(INTERVAL_WINDOW)
def group_by_interval(df):
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
intervaled_df = df.groupby([pd.Grouper(key=DATE_COLUMN_NAME, freq=INTERVAL_WINDOW)]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
return intervaled_df
def arrange_data(df):
df = group_by_interval(df)
return df
def divide_df_by_column(df, dividing_colum='Hostname'):
df_by_groups = {f'{dividing_colum}_{g}': d for i, (g, d) in enumerate(df.groupby(dividing_colum))}
return df_by_groups
def arrange_groups_df(lst_df_by_groups):
df_by_intervaled_group = dict()
# For each group dataframe
for group_df_name, group_df in lst_df_by_groups.items():
df_by_intervaled_group[group_df_name] = arrange_data(group_df)
return df_by_intervaled_group
# Load data
df = pd.read_csv('data2.csv')
# Print data
print(df)
# Divide by column
df_by_groups = divide_df_by_column(df)
# Arrange data for each group
df_by_intervaled_group = arrange_groups_df(df_by_groups)
# For each hostname-key dataframe
for group_df_name, intervaled_group_df in df_by_intervaled_group.items():
print(f'Group {group_df_name} dataframe:')
print(intervaled_group_df)
示例输出:
我们现在可以通过测量每个主机名区域的体积(车辆数量)和平均速度来获得有价值的结果。
DateTimeStamp VS_ID VS_Summary_Id Hostname Vehicle_speed Lane Length
0 11/01/2019 8:22 1 1 place_uno 65 2 71
1 11/01/2019 8:23 2 1 place_uno 59 1 375
2 11/01/2019 8:25 3 1 place_uno 59 1 389
3 11/01/2019 8:26 4 1 place_duo 59 1 832
4 11/01/2019 8:40 5 1 place_duo 52 1 409
Group Hostname_place_duo dataframe:
average_speed volume
DateTimeStamp
2019-11-01 08:15:00 59 1
2019-11-01 08:30:00 52 1
Group Hostname_place_uno dataframe:
average_speed volume
DateTimeStamp
2019-11-01 08:15:00 61 3
附录
还创建了一个round_time 函数,该函数允许按时间间隔舍入,无需分组:
def round_time(df):
# Setting date_column_name to be of dateime
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
# Grouping by interval
df[TIME_INTERVAL_COLUMN_NAME] = df[DATE_COLUMN_NAME].dt.round(INTERVAL_WINDOW)
第三次更新
所以这次我们要减少结果中的行数。
- 我们改变了对数据进行分组的方式,不仅基于时间间隔,还基于一周中的每一天,结果将允许我们调查一周中每一天的流量行为,它是 15 分钟的时间间隔。
group_by_interval 函数现在更改为在简明间隔上分组,因此将被称为 group_by_concised_interval。
我们将 [day-in-week, hour-minute] 的组合称为“consice interval”,同样可以使用 CONCISE_INTERVAL_FORMAT 进行配置。
def group_by_concised_interval(df):
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
# Rounding time
round_time(df)
# Adding concised interval
add_consice_interval_columns(df)
intervaled_df = df.groupby([TIME_INTERVAL_CONCISE_COLUMN_NAME]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
return intervaled_df
1.1。使用round_time 方法,group_by_concised_interval 第一轮时间到给定的 15 分钟间隔(可通过 INTERVAL_WINDOW 配置)。
1.2。在为每个日期创建时间间隔后,我们应用 add_consice_interval_columns 函数,该函数给出四舍五入到整数的时间戳,提取简明形式。
def add_consice_interval_columns(df):
# Adding columns for time interval in day-in-week and hour-minute resolution
df[TIME_INTERVAL_CONCISE_COLUMN_NAME] = df[TIME_INTERVAL_COLUMN_NAME].apply(lambda x: x.strftime(CONCISE_INTERVAL_FORMAT))
整个代码是:
import functools
import numpy
import pandas as pd
TIME_INTERVAL_COLUMN_NAME = 'time_interval'
TIME_INTERVAL_CONCISE_COLUMN_NAME = 'time_interval_concise'
DATE_COLUMN_NAME = 'DateTimeStamp'
INTERVAL_WINDOW = '15Min'
CONCISE_INTERVAL_FORMAT = '%A %H:%M'
def round_time(df):
# Setting date_column_name to be of dateime
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
# Grouping by interval
df[TIME_INTERVAL_COLUMN_NAME] = df[DATE_COLUMN_NAME].dt.round(INTERVAL_WINDOW)
def add_consice_interval_columns(df):
# Adding columns for time interval in day-in-week and hour-minute resolution
df[TIME_INTERVAL_CONCISE_COLUMN_NAME] = df[TIME_INTERVAL_COLUMN_NAME].apply(lambda x: x.strftime(CONCISE_INTERVAL_FORMAT))
def group_by_concised_interval(df):
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
# Rounding time
round_time(df)
# Adding concised interval
add_consice_interval_columns(df)
intervaled_df = df.groupby([TIME_INTERVAL_CONCISE_COLUMN_NAME]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
return intervaled_df
def arrange_data(df):
df = group_by_concised_interval(df)
return df
def divide_df_by_column(df, dividing_colum='Hostname'):
df_by_groups = {f'{dividing_colum}_{g}': d for i, (g, d) in enumerate(df.groupby(dividing_colum))}
return df_by_groups
def arrange_groups_df(lst_df_by_groups):
df_by_intervaled_group = dict()
# For each group dataframe
for group_df_name, group_df in lst_df_by_groups.items():
df_by_intervaled_group[group_df_name] = arrange_data(group_df)
return df_by_intervaled_group
# Load data
df = pd.read_csv('data2.csv')
# Print data
print(df)
# Divide by column
df_by_groups = divide_df_by_column(df)
# Arrange data for each group
df_by_intervaled_group = arrange_groups_df(df_by_groups)
# For each hostname-key dataframe
for group_df_name, intervaled_group_df in df_by_intervaled_group.items():
print(f'Group {group_df_name} dataframe:')
print(intervaled_group_df)
输出:
Group Hostname_place_duo dataframe:
average_speed volume
time_interval_concise
Friday 08:30 59 1
Friday 08:45 52 1
Group Hostname_place_uno dataframe:
average_speed volume
time_interval_concise
Friday 08:15 65 1
Friday 08:30 59 2
因此,现在我们可以轻松计算出一周中每一天在所有可用时间间隔内的流量表现。