Spark 无法使用 mongo-hadoop-connector 的 BSONFileInputFormat 编译 newAPIHadoopRDD答案

【问题标题】：Spark cannot compile newAPIHadoopRDD with mongo-hadoop-connector's BSONFileInputFormatSpark 无法使用 mongo-hadoop-connector 的 BSONFileInputFormat 编译 newAPIHadoopRDD
【发布时间】：2016-08-03 10:43:26
【问题描述】：

我在 spark 中使用 mongo-hadoop 客户端（r1.5.2）通过以下链接从 mongoDB 和 bson 读取数据：https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage。到目前为止，我可以毫无问题地从 mongoDB 读取数据。但是，bson 配置甚至无法编译。请帮忙。

我在 scala 中的代码：

dataConfig.set("mapred.input.dir", "path.bson")

    val documents = sc.newAPIHadoopRDD(
      dataConfig,                
      classOf[BSONFileInputFormat],  
      classOf[Object],            
      classOf[BSONObject])

错误：

Error:(56, 24) inferred type arguments [Object,org.bson.BSONObject,com.mongodb.hadoop.mapred.BSONFileInputFormat] do not conform to method newAPIHadoopRDD's type parameter bounds [K,V,F <: org.apache.hadoop.mapreduce.InputFormat[K,V]]
    val documents = sc.newAPIHadoopRDD(
                       ^

【问题讨论】：

尝试使用 BSONFileInputFormat 而不是 MongoInputFormat。另请指定您使用的是哪个版本的 mongo-hadoop 连接器。

标签： mongodb scala hadoop apache-spark

【解决方案1】：

我找到了解决办法！这个问题似乎是由 InputFormat 的泛型引起的

newAPIHadoopRDD 要求输入格式为

F <: org.apache.hadoop.mapreduce.InputFormat[K,V]

虽然 BSONFileInputFormat 扩展了 FileInputFormat[K,V]，后者扩展了 InputFormat[K,V]，但它没有将 K,V 泛型指定为 Object 和 BSONObject。（实际上BSONFileInputFormat中没有提到K，V泛型，类真的可以编译吗？）。

无论如何，解决方案是将 BSONFileInputFormat 转换为 InputFormat 的子类，并定义了 K 和 V：

val documents = sc.newAPIHadoopRDD(
  dataConfig,                
  classOf[BSONFileInputFormat].asSubclass(classOf[org.apache.hadoop.mapreduce.lib.input.FileInputFormat[Object, BSONObject]]),  
  classOf[Object],            
  classOf[BSONObject])

现在它可以正常工作了:)

【讨论】：