如何使用pyspark将火花与蜂巢连接起来？答案

【问题标题】：How to connect spark with hive using pyspark?如何使用pyspark将火花与蜂巢连接起来？
【发布时间】：2019-08-15 18:23:40
【问题描述】：

我正在尝试使用pyspark 远程读取配置单元表。它指出无法连接到 Hive Metastore 客户端的错误。

我已经阅读了关于 SO 和其他来源的多个答案，它们主要是配置，但没有一个可以解决我无法远程连接的原因。我阅读了documentation 并观察到，无需更改任何配置文件，我们就可以将 spark 与hive 连接起来。注意：我已将运行hive 的机器端口转发给localhost:10000。我什至使用presto 连接了相同的设备，并且能够在hive 上运行查询。

代码是：

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
SparkContext.setSystemProperty("hive.metastore.uris", "thrift://localhost:9083")
sparkSession = (SparkSession
                .builder
                .appName('example-pyspark-read-and-write-from-hive')
                .enableHiveSupport()
                .getOrCreate())
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)
df.write.saveAsTable('example')

我希望输出是对保存表的确认，但相反，我面对的是this error。

抽象错误是：

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 775, in saveAsTable
    self._jwrite.saveAsTable(name)
  File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/local/spark/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'

我发出了一个命令：

ssh -i ~/.ssh/id_rsa_sc -L 9000:A.B.C.D:8080 -L 9083:E.F.G.H:9083 -L 10000:E.F.G.H:10000 ubuntu@I.J.K.l

当我通过命令检查端口 10000 和 9083 时：

aviral@versinator:~/testing-spark-hive$ nc -zv localhost 10000
Connection to localhost 10000 port [tcp/webmin] succeeded!
aviral@versinator:~/testing-spark-hive$ nc -zv localhost 9083
Connection to localhost 9083 port [tcp/*] succeeded!

运行脚本时，我收到以下错误：

Caused by: java.net.UnknownHostException: ip-172-16-1-101.ap-south-1.compute.internal
    ... 45 more

【问题讨论】：

stackoverflow.com/questions/36051091/…你可能会在这里得到一些想法
不，它没有。它显示“无法连接到 Metastore 服务器”。

标签： python-3.x hive pyspark pyspark-sql thrift-protocol

【解决方案1】：

关键在于让 hive 配置在创建 spark 会话本身时被存储。

sparkSession = (SparkSession
                .builder
                .appName('example-pyspark-read-and-write-from-hive')
                .config("hive.metastore.uris", "thrift://localhost:9083", conf=SparkConf())
                .enableHiveSupport()
                .getOrCreate()
                )

需要注意的是，不需要更改 spark conf，即使是 AWS Glue 之类的无服务器服务也可以有这样的连接。

完整代码：

from pyspark import SparkContext, SparkConf
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession, HiveContext
"""
SparkSession ss = SparkSession
.builder()
.appName(" Hive example")
.config("hive.metastore.uris", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate();
"""

sparkSession = (SparkSession
                .builder
                .appName('example-pyspark-read-and-write-from-hive')
                .config("hive.metastore.uris", "thrift://localhost:9083", conf=SparkConf())
                .enableHiveSupport()
                .getOrCreate()
                )
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)
# Write into Hive
#df.write.saveAsTable('example')

df_load = sparkSession.sql('SELECT * FROM example')
df_load.show()
print(df_load.show())

【讨论】：

我正在尝试做同样的事情，但使用 hive3.0 并且它没有显示任何 hive 表，它只是连接到 spark 目录，有什么原因吗？