将 Spark df 存储到 HBase答案

【问题标题】：Store Spark df to HBase将 Spark df 存储到 HBase
【发布时间】：2018-05-10 11:53:24
【问题描述】：

我正在尝试以一种有效的方式将 Spark DataSet 存储到 HBase。当我们尝试在 JAVA 中使用 lambda 做类似的事情时：

sparkDF.foreach(l->this.hBaseConnector.persistMappingToHBase(l,"name_of_hBaseTable") );

persistMappingtoHBase 函数使用 HBase Java 客户端 (Put) 存储在 HBase 中。

I get an exception: Exception in thread "main"  org.apache.spark.SparkException: Task not serializable

然后我们尝试了这个：

sparkDF.foreachPartition(partition -> {
    final HBaseConnector hBaseConnector = new HBaseConnector();
    hBaseConnector.connect(hbaseProps);
    while (partition.hasNext()) {
        hBaseConnector.persistMappingToHBase(partition.next());
    }
    hBaseConnector.closeConnection();
});

这似乎有效，但似乎效率很低，我猜是因为我们为数据帧的每一行创建并关闭了一个连接。

将 spark DS 存储到 HBase 的好方法是什么？我看到了a connector developed by IBM，但从未使用过。

【问题讨论】：

我们在 Splice Machine 有一个非常快的连接器，可以将数据帧本地写入 hbase。我相信我们仍然是最快的 HBase 编写者...youtube.com/watch?v=cgIz-cjehJ0&t=3s
谢谢，如果时间允许，我会试试的

标签： java sql apache-spark hbase

【解决方案1】：

以下可用于将内容保存到HBase

val hbaseConfig = HBaseConfiguration.create
hbaseConfig.set("hbase.zookeeper.quorum", "xx.xxx.xxx.xxx")
hbaseConfig.set("hbase.zookeeper.property.clientPort", "2181")

val job = Job.getInstance(hbaseConfig)
job.setOutputFormatClass(classOf[TableOutputFormat[_]])
job.getConfiguration.set(TableOutputFormat.OUTPUT_TABLE, "test_table")

val result = sparkDF.map(row -> {
    //  Using UUID as my rowkey, you can use your own rowkey
    val put = new Put(Bytes.toBytes(UUID.randomUUID().toString))

    //  setting the value of each row to Put object
    ....
    ....

    new Tuple2[ImmutableBytesWritable, Put](new ImmutableBytesWritable(), put)
});

//  save result to hbase table
result.saveAsNewAPIHadoopDataset(job.getConfiguration)

我的build.sbt 文件中有以下依赖项

libraryDependencies += "org.apache.hbase" % "hbase-common" % "1.3.0"
libraryDependencies += "org.apache.hbase" % "hbase-client" % "1.3.0"
libraryDependencies += "org.apache.hbase" % "hbase-server" % "1.3.0"

【讨论】：