如何在 Spark 中输出分桶镶木地板文件？答案

【问题标题】：How do I output bucketed parquet files in spark?如何在 Spark 中输出分桶镶木地板文件？
【发布时间】：2019-10-29 07:09:21
【问题描述】：

背景

我有 8k parquet 文件代表我想按特定列存储的表，创建一组新的 8k parquet 文件。我想这样做，以便来自分桶列上其他数据集的连接不需要重新洗牌。我正在处理的文档在这里：

https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#bucketing-sorting-and-partitioning

问题

输出分桶的 parquet 文件最简单的方法是什么？我想做这样的事情：

df.write()
    .bucketBy(8000, "myBucketCol")
    .sortBy("myBucketCol")
    .format("parquet")
    .save("path/to/outputDir");

但是根据上面链接的文档：

分桶和排序仅适用于持久表

我猜我需要使用saveAsTable 而不是save。然而saveAsTable 没有走上一条路。我需要在调用saveAsTable 之前创建一个表吗？我是否在该表创建语句中声明了应将镶木地板文件写入何处？如果是这样，我该怎么做？

【问题讨论】：

标签： apache-spark apache-spark-sql parquet

【解决方案1】：

spark.sql("drop table if exists myTable");
spark.sql("create table myTable ("
    + "myBucketCol string, otherCol string ) "
    + "using parquet location '" + outputPath + "' "
    + "clustered by (myBucketCol) sorted by (myBucketCol) into 8000 buckets"
);
enlDf.write()
    .bucketBy(8000, "myBucketCol")
    .sortBy("myBucketCol")
    .format("parquet")
    .mode(SaveMode.Append)
    .saveAsTable("myTable");

【讨论】：

这看起来很奇怪。分桶一定要用sql字符串吗？

【解决方案2】：

您可以使用path 选项：

df.write()
    .bucketBy(8000, "myBucketCol")
    .sortBy("myBucketCol")
    .format("parquet")
    .option("path", "path/to/outputDir")
    .saveAsTable("whatever")

【讨论】：