【发布时间】:2020-12-28 05:32:50
【问题描述】:
基本问题
我有一个大约 100 亿行的数据集。我正在寻找在四个不同时间窗口(3 天、7 天、14 天、21 天)内计算滚动/窗口聚合/指标(总和、平均值、最小值、最大值、标准差)的最高效方法。
Spark/AWS EMR 规范
火花版本:2.4.4
ec2 实例类型:r5.24xlarge
核心 ec2 实例数:10
num pyspark 分区:600
概述
我阅读了一堆 SO 帖子,这些帖子讨论了计算滚动统计数据的机制或如何使 Window 函数更快。但是,没有一篇文章以解决我的问题的方式结合了这两个概念。我在下面展示了一些可以满足我需求的选项,但我需要它们在我的真实数据集上运行得更快,所以我正在寻找更快/更好的建议。
我的数据集结构如下,但有大约 100 亿行:
+--------------------------+----+-----+
|date |name|value|
+--------------------------+----+-----+
|2020-12-20 17:45:19.536796|1 |5 |
|2020-12-21 17:45:19.53683 |1 |105 |
|2020-12-22 17:45:19.536846|1 |205 |
|2020-12-23 17:45:19.536861|1 |305 |
|2020-12-24 17:45:19.536875|1 |405 |
|2020-12-25 17:45:19.536891|1 |505 |
|2020-12-26 17:45:19.536906|1 |605 |
|2020-12-20 17:45:19.536796|2 |10 |
|2020-12-21 17:45:19.53683 |2 |110 |
|2020-12-22 17:45:19.536846|2 |210 |
|2020-12-23 17:45:19.536861|2 |310 |
|2020-12-24 17:45:19.536875|2 |410 |
|2020-12-25 17:45:19.536891|2 |510 |
|2020-12-26 17:45:19.536906|2 |610 |
|2020-12-20 17:45:19.536796|3 |15 |
|2020-12-21 17:45:19.53683 |3 |115 |
|2020-12-22 17:45:19.536846|3 |215 |
我需要我的数据集如下所示。注意:显示了 7 天窗口的窗口统计信息,但我还需要另外三个窗口。
+--------------------------+----+-----+----+-----+---+---+------------------+
|date |name|value|sum |mean |min|max|stddev |
+--------------------------+----+-----+----+-----+---+---+------------------+
|2020-12-20 17:45:19.536796|1 |5 |5 |5.0 |5 |5 |NaN |
|2020-12-21 17:45:19.53683 |1 |105 |110 |55.0 |5 |105|70.71067811865476 |
|2020-12-22 17:45:19.536846|1 |205 |315 |105.0|5 |205|100.0 |
|2020-12-23 17:45:19.536861|1 |305 |620 |155.0|5 |305|129.09944487358058|
|2020-12-24 17:45:19.536875|1 |405 |1025|205.0|5 |405|158.11388300841898|
|2020-12-25 17:45:19.536891|1 |505 |1530|255.0|5 |505|187.08286933869707|
|2020-12-26 17:45:19.536906|1 |605 |2135|305.0|5 |605|216.02468994692867|
|2020-12-20 17:45:19.536796|2 |10 |10 |10.0 |10 |10 |NaN |
|2020-12-21 17:45:19.53683 |2 |110 |120 |60.0 |10 |110|70.71067811865476 |
|2020-12-22 17:45:19.536846|2 |210 |330 |110.0|10 |210|100.0 |
|2020-12-23 17:45:19.536861|2 |310 |640 |160.0|10 |310|129.09944487358058|
|2020-12-24 17:45:19.536875|2 |410 |1050|210.0|10 |410|158.11388300841898|
|2020-12-25 17:45:19.536891|2 |510 |1560|260.0|10 |510|187.08286933869707|
|2020-12-26 17:45:19.536906|2 |610 |2170|310.0|10 |610|216.02468994692867|
|2020-12-20 17:45:19.536796|3 |15 |15 |15.0 |15 |15 |NaN |
|2020-12-21 17:45:19.53683 |3 |115 |130 |65.0 |15 |115|70.71067811865476 |
|2020-12-22 17:45:19.536846|3 |215 |345 |115.0|15 |215|100.0 |
详情
为了便于阅读,我将在这些示例中只做一个窗口。我尝试过的事情:
- 基本
Window().over()语法 - 将窗口值转换为数组列并使用高阶函数
- Spark SQL
设置
import datetime
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import FloatType
import pandas as pd
import numpy as np
spark = SparkSession.builder.appName('example').getOrCreate()
# create spark dataframe
n = 7
names = [1, 2, 3]
date_list = [datetime.datetime.today() - datetime.timedelta(days=(n-x)) for x in range(n)]
values = [x*100 for x in range(n)]
rows = []
for name in names:
for d, v in zip(date_list, values):
rows.append(
{
"name": name,
"date": d,
"value": v+(5*name)
}
)
df = spark.createDataFrame(data=rows)
# setup window
window_days = 7
window = (
Window
.partitionBy(F.col("name"))
.orderBy(F.col("date").cast("timestamp").cast("long"))
.rangeBetween(-window_days * 60 * 60 * 24 + 1, Window.currentRow)
)
1。基本
这会创建多个窗口规范,如 here 所示,因此以串行方式执行,并且在大型数据集上运行非常缓慢
status_quo = (df
.withColumn("sum",F.sum(F.col("value")).over(window))
.withColumn("mean",F.avg(F.col("value")).over(window))
.withColumn("min",F.min(F.col("value")).over(window))
.withColumn("max",F.max(F.col("value")).over(window))
.withColumn("stddev",F.stddev(F.col("value")).over(window))
)
status_quo.show()
status_quo.explain()
2。数组列 --> 高阶函数
每this answer 似乎创建的窗口规格更少,但aggregate() 函数语法对我来说毫无意义,我不知道如何使用高阶函数编写stddev,并且性能似乎没有在小型测试中要好得多
@F.udf(returnType=FloatType())
def array_stddev(row_value):
"""
temporary function since I don't know how to write higher order standard deviation
"""
return np.std(row_value, dtype=float).tolist()
# 1. collect window into array column
# 2. use higher order (array) functions to calculate aggregations over array (window values)
# Question: how to write standard deviation in aggregate()
hof_example = (
df
.withColumn("value_array", F.collect_list(F.col("value")).over(window))
.withColumn("sum_example", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + x)'))
.withColumn("mean_example", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + x, acc -> acc / size(value_array))'))
.withColumn("max_example", F.array_max(F.col("value_array")))
.withColumn("min_example", F.array_min(F.col("value_array")))
.withColumn("std_example", array_stddev(F.col("value_array")))
)
3。火花 SQL
这似乎是简单测试中最快的。唯一(次要)问题是我的代码库的其余部分使用 DataFrame API。在小型测试中似乎更快,但未在完整数据集上进行测试。
df.createOrReplaceTempView("df")
sql_example = spark.sql(
"""
SELECT
*
, sum(value)
OVER (
PARTITION BY name
ORDER BY CAST(date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS sum
, mean(value)
OVER (
PARTITION BY name
ORDER BY CAST(date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS mean
, min(value)
OVER (
PARTITION BY name
ORDER BY CAST(date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS min
, max(value)
OVER (
PARTITION BY name
ORDER BY CAST(date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS max
, stddev(value)
OVER (
PARTITION BY name
ORDER BY CAST(date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS stddev
FROM df"""
)
【问题讨论】:
标签: apache-spark pyspark apache-spark-sql window-functions rolling-computation