Sparksql 使用 where 子句获取示例行答案

【问题标题】：Sparksql get sample rows with where clauseSparksql 使用 where 子句获取示例行
【发布时间】：2021-10-12 00:12:18
【问题描述】：

是否可以使用 where 子句获取样本 n 行查询？

我尝试使用下面的 tablesample 函数，但最终只获得了第一个分区“2021-09-14”中的记录。 P

select * from (select * from table where ts in ('2021-09-14', '2021-09-15')) tablesample（100行）

【问题讨论】：

标签： apache-spark apache-spark-sql

【解决方案1】：

您可以利用单调递增 ID - here 或 Rand 生成一个附加列，该列可用于对您的数据集进行排序以生成必要的采样字段

这两个功能可以结合使用，也可以单独使用

您还可以使用LIMIT 子句对您所需的N 记录进行采样

注意 - orderBy 将是一项昂贵的操作

数据准备

input_str = """
1   2/12/2019   114 2
2   3/5/2019    116 1
3   3/3/2019    120 6
4   3/4/2019    321 10
6   6/5/2019    116 1
7   6/3/2019    116 1
8   10/1/2019   120 3
9   10/1/2019   120 3
10  10/1/2020   120 3
11  10/1/2020   120 3
12  10/1/2020   120 3
13  10/1/2022   120 3
14  10/1/2021   120 3
15  10/6/2019   120 3
""".split()

input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))

cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "shipment_id  ship_date   customer_id quantity".split()))
            
n = len(input_values)

input_list = [tuple(input_values[i:i+4]) for i in range(0,n,4)]

sparkDF = sql.createDataFrame(input_list, cols)

sparkDF = sparkDF.withColumn('ship_date',F.to_date(F.col('ship_date'),'d/M/yyyy'))

sparkDF.show()

+-----------+----------+-----------+--------+
|shipment_id| ship_date|customer_id|quantity|
+-----------+----------+-----------+--------+
|          1|2019-12-02|        114|       2|
|          2|2019-05-03|        116|       1|
|          3|2019-03-03|        120|       6|
|          4|2019-04-03|        321|      10|
|          6|2019-05-06|        116|       1|
|          7|2019-03-06|        116|       1|
|          8|2019-01-10|        120|       3|
|          9|2019-01-10|        120|       3|
|         10|2020-01-10|        120|       3|
|         11|2020-01-10|        120|       3|
|         12|2020-01-10|        120|       3|
|         13|2022-01-10|        120|       3|
|         14|2021-01-10|        120|       3|
|         15|2019-06-10|        120|       3|
+-----------+----------+-----------+--------+

Order By - 单调递增的 ID 和 Rand

sparkDF.createOrReplaceTempView("shipment_table")

sql.sql("""
SELECT
 *
FROM (
    SELECT 
        *
        ,monotonically_increasing_id() as increasing_id
        ,RAND(10) as random_order
    FROM shipment_table
    WHERE ship_date BETWEEN '2019-01-01' AND '2019-12-31'
    ORDER BY monotonically_increasing_id() DESC ,RAND(10) DESC
    LIMIT 5
)
""").show()

+-----------+----------+-----------+--------+-------------+-------------------+
|shipment_id| ship_date|customer_id|quantity|increasing_id|       random_order|
+-----------+----------+-----------+--------+-------------+-------------------+
|         15|2019-06-10|        120|       3|   8589934593|0.11682250456449328|
|          9|2019-01-10|        120|       3|   8589934592|0.03422639313807285|
|          8|2019-01-10|        120|       3|            6| 0.8078688178371882|
|          7|2019-03-06|        116|       1|            5|0.36664222617947817|
|          6|2019-05-06|        116|       1|            4|    0.2093704977577|
+-----------+----------+-----------+--------+-------------+-------------------+

【讨论】：

可以不使用order by吗？我可以用这个spark.apache.org/docs/latest/…
如果我们按 rand() 排序，为什么需要 monotonically_increasing_id()？
添加一个额外的字段来统一采样数据集

【解决方案2】：

如果您使用的是Dataset，则可以使用documenation 中所述的内置功能：

sample(withReplacement: Boolean, fraction: Double): Dataset[T]

Returns a new Dataset by sampling a fraction of rows, using a random seed.

withReplacement: Sample with replacement or not.
fraction: Fraction of rows to generate, range [0.0, 1.0].

Since

    1.6.0
Note

    This is NOT guaranteed to provide exactly the fraction of the total count of the given Dataset.

要使用它，您需要根据您正在寻找的任何标准过滤您的数据集，然后对结果进行采样。如果您需要准确行数而不是分数，您可以在调用sample 之后使用limit(n)，其中n 是要返回的行数。

【讨论】：

我用的是SparkSql，等效函数是spark.apache.org/docs/latest/…