Spark如何计算分区数？答案

【问题标题】：How does Spark compute number of partitions?Spark如何计算分区数？
【发布时间】：2016-07-22 06:27:36
【问题描述】：

我的 /accounts/* 目录有 7 个文件，每个文件的大小小于块大小。

我想知道 Spark 是如何计算 Partition 的。 “textFile”方法的第二个参数是向 Spark 提示分区数，但是否有任何逻辑基于它决定分区数。

10个输入，15个分区，20个输入，25个分区

这是如何计算的？

问候！

scala> var accounts= sc.textFile("/accounts/*",3)

scala> accounts.toDebugString
15/10/12 02:41:45 INFO mapred.FileInputFormat: Total input paths to  process : 7
res0: String = 
(7) /accounts/* MapPartitionsRDD[1] at textFile at <console>:21 []
 |  /accounts/* HadoopRDD[0] at textFile at <console>:21 []

scala> var accounts= sc.textFile("/accounts/*",10)
scala> accounts.toDebugString
15/10/12 02:42:01 INFO mapred.FileInputFormat: Total input paths to process : 7
res1: String = 
 (15) /accounts/* MapPartitionsRDD[3] at textFile at <console>:21 []
 |   /accounts/* HadoopRDD[2] at textFile at <console>:21 []

scala> var accounts= sc.textFile("/accounts/*",20)
scala> accounts.toDebugString
15/10/12 02:42:01 INFO mapred.FileInputFormat: Total input paths to process : 7
res1: String = 
 (23) /accounts/* MapPartitionsRDD[3] at textFile at <console>:21 []
 |   /accounts/* HadoopRDD[2] at textFile at <console>:21 []

【问题讨论】：

你为什么想知道？
好问题丹尼尔，对行为感到好奇

标签： apache-spark

【解决方案1】：

Spark 不计算分区数。它只是将提示传递给 Hadoop 库。 Hadoop 用它做什么？这取决于。查看特定 InputFormat 的 getSplits 方法的文档（或更可能是代码）。

例如TextInputFormat 的代码在FileInputFormat.getSplits 中。它非常复杂，取决于几个配置参数。

【讨论】：

【解决方案2】：

一般来说，从 HDFS 读取数据时，spark 会为每个 HDFS 块创建一个分区。

有关分区如何流经管道以及您可以调整哪些内容的更多详细信息here。

【讨论】：