【发布时间】:2021-04-30 01:31:34
【问题描述】:
我们在 Python (Luigi) 中有 ETL 作业。它们都连接到 Hive Metastore 以获取分区信息。
代码:
from hive_metastore import ThriftHiveMetastore
client = ThriftHiveMetastore.Client(protocol)
partitions = client.get_partition_names('sales', 'salesdetail', -1)
-1 是 max_parts(返回的最大分区数)
它像这样随机超时:
File "/opt/conda/envs/etl/lib/python2.7/site-packages/luigi/contrib/hive.py", line 210, in _existing_partitions
partition_strings = client.get_partition_names(database, table, -1)
File "/opt/conda/envs/etl/lib/python2.7/site-packages/hive_metastore/ThriftHiveMetastore.py", line 1703, in get_partition_names
return self.recv_get_partition_names()
File "/opt/conda/envs/etl/lib/python2.7/site-packages/hive_metastore/ThriftHiveMetastore.py", line 1716, in recv_get_partition_names
(fname, mtype, rseqid) = self._iprot.readMessageBegin()
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin
sz = self.readI32()
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/protocol/TBinaryProtocol.py", line 206, in readI32
buff = self.trans.readAll(4)
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/transport/TTransport.py", line 58, in readAll
chunk = self.read(sz - have)
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/transport/TTransport.py", line 159, in read
self.__rbuf = StringIO(self.__trans.read(max(sz, self.__rbuf_size)))
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/transport/TSocket.py", line 105, in read
buff = self.handle.recv(sz)
timeout: timed out
此错误偶尔会发生。
Hive Metastore 有 15 分钟超时。
当我调查单独运行 get_partition_names 时,它会在几秒钟内返回数据。
即使我将 socket.timeout 设置为 1 或 2 秒,查询也会完成。
Hive Metastore日志cat /var/log/hive/..log.out中没有socket关闭连接消息的记录@
它通常超时的表有大量分区 ~10K+。但如前所述,它们只是随机超时。当单独测试这部分代码时,它们会快速返回分区元数据。
任何想法为什么它会随机超时,或者如何在 Metastore 日志中捕获这些超时错误,或者如何修复它们?
【问题讨论】:
标签: python hadoop hive thrift hive-metastore