在 Apache Spark 中使用 Bigquery 连接器时如何设置分区数？答案

【问题标题】：How can I set the number of partitions when using the Bigquery Connector in Apache Spark?在 Apache Spark 中使用 Bigquery 连接器时如何设置分区数？
【发布时间】：2018-02-25 22:47:49
【问题描述】：

我正在阅读 Google Cloud Dataproc 和一般 Apache Spark 的文档，但无法弄清楚在使用 Bigquery 连接器时如何手动设置分区数。

HDD 是使用newAPIHadoopRDD 创建的，我强烈怀疑这可以通过传递给此函数的配置文件进行设置。但我实际上无法弄清楚配置文件的可能值是什么。 Spark 文档或 Google 文档似乎都没有指定或链接到 Hadoop 作业配置文件规范。

有没有办法在创建此 RDD 时设置分区，还是我只需要在下一步重新分区？

【问题讨论】：

标签： apache-spark pyspark google-bigquery google-cloud-dataproc

【解决方案1】：

你需要在你的 spark 代码中重新分区，例如：

val REPARTITION_VALUE = 24
val rdd = sc.newAPIHadoopRDD(conf,classOf[GsonBigQueryInputFormat],classOf[LongWritable],classOf[JsonObject])
rdd.map(x => f(x))
.repartition(REPARTITION_VALUE)
.groupBy(_.1)
.map(tup2 => f(tup2._1,tup2._2.toSeq))
.repartition(REPARTITION_VALUE)

等等...
当您使用 rdd 时，您将需要处理分区
解决方案：最好的解决方案是使用 Dataset 或 DataFram

【讨论】：