【问题标题】:Using Impala JDBC in Scala Spark application在 Scala Spark 应用程序中使用 Impala JDBC
【发布时间】:2021-02-03 15:53:29
【问题描述】:

我正在尝试将 Cloudera 的 Impala JDBC 2.6.17.1020 连接器驱动程序与 Spark 一起使用,以便能够同时访问 Kudu 和 Hive 中的表。

当查询很简单时,它工作正常,我得到了预期的输出。

val simpleQuery = """ SELECT *
                       FROM HIVE_TABLE
                       WHERE ABC = '123'
                   """
val simpleQueryDF : DataFrame = spark.read.jdbc(this.impalaJDBCString,"("+simpleQuery+") as temp",this.impalaConnectionProps)
simpleQueryDF.show()

但是当查询包含嵌套查询和连接时,我将标题重复为行,如下所示

+----------+--------+------------------------+--------------+ ...
|as_of_date|dq_index|impacted_data_lake_table|pk_column_name| ...
+----------+--------+------------------------+--------------+ ...
|as_of_date|dq_index|    impacted_data_lak...|pk_column_name| ...
|as_of_date|dq_index|    impacted_data_lak...|pk_column_name| ...
|as_of_date|dq_index|    impacted_data_lak...|pk_column_name| ...
+----------+--------+------------------------+--------------+ ...

我不知道为什么会出现这种情况,有人可以告诉我该怎么做吗?

复杂查询如下

SELECT    '2020-10-22' AS AS_OF_DATE,
          DATA_QUERY.DQ_INDEX,
          DATA_QUERY.IMPACTED_DATA_LAKE_TABLE,
          DATA_QUERY.PK_COLUMN_NAME,
          DATA_QUERY.PK_OF_BAD_RECORD,
          DATA_QUERY.DQ_ASSESSMENT_DIMENSION,
          DATA_QUERY.SOURCE_SYSTEM,
          'ABC' AS BUSINESS_SUBJECT_AREA,
          DATA_QUERY.SUB_SUBJECT_AREA_CD,
          NVL(SBJ_AREA_DESC_TABLE.SUB_SUBJECT_AREA, 'Not Recognized') AS SUB_SUBJECT_AREA
FROM (
        SELECT  'DQ_INDEX_1' AS DQ_INDEX,
                'dq_data_governance.table_1 fin_year
                 dq_data_governance.table_2
                 dq_data_governance.table_3
                 dq_data_governance.table_4 ' AS IMPACTED_DATA_LAKE_TABLE,
                NAMES AS PK_COLUMN_NAME,
                CAST(VALS AS STRING) AS PK_OF_BAD_RECORD,
                'KLM' AS DQ_ASSESSMENT_DIMENSION,
                'NOQ' AS SOURCE_SYSTEM,
                'TTV' AS SUB_SUBJECT_AREA_CD
        FROM (
                SELECT 'table_4.SEGMENT4|table_4.SEGMENT8|table_1.FIN_YEAR' AS NAMES,
                        concat(GCC.SEGMENT4,'|',GCC.SEGMENT8,'|',CAST(FIN_YEAR as STRING)) AS VALS
                FROM dq_data_governance.table_1 FIN_YEAR,
                     dq_data_governance.table_2 gjh,
                     dq_data_governance.table_3 gjl,
                     dq_data_governance.table_4 GCC
                WHERE GJH.JE_HEADER_ID = GJL.JE_HEADER_ID
                  AND GJL.CODE_COMBINATION_ID = GCC.CODE_COMBINATION_ID
                  AND ACTUAL_FLAG = '5'
                  AND GCC.SEGMENT9 = '9'
                  AND EFFECTIVE_DATE BETWEEN FIN_YEAR.from_date AND FIN_YEAR.to_date 
                GROUP BY GCC.SEGMENT4,GCC.SEGMENT8,FIN_YEAR
                HAVING sum((nvl((GJL.ACCOUNTED_DR), 0) - nvl((GJL.ACCOUNTED_CR), 0))) <-0.1 
        ) AS T_DQ_INDEX_1
        
        UNION ALL 
        
        SELECT  'DQ_INDEX_2' AS DQ_INDEX,
                'dq_data_governance.table_5' AS IMPACTED_DATA_LAKE_TABLE,
                NAMES AS PK_COLUMN_NAME,
                CAST(VALS AS STRING) AS PK_OF_BAD_RECORD,
                'BKL' AS DQ_ASSESSMENT_DIMENSION,
                'CCG' AS SOURCE_SYSTEM,
                'LMA' AS SUB_SUBJECT_AREA_CD
        FROM (
                SELECT 'table_5.contractid|table_5.name' AS NAMES,
                        concat(CAST(c.contractid as STRING),'|',c.name) AS VALS
                FROM dq_data_governance.table_5 c
                INNER JOIN dq_data_governance.table_6 ps ON ps.processinstanceid=c.processinstanceid
                WHERE c.amountinriyal<1 and ps.approvalstatusid=3
        ) AS T_DQ_INDEX_2
        
        UNION ALL
        
        # ...
        # <MORE QUERIES CAN BE ADDED HERE>
        # ...
        
) AS DATA_QUERY
LEFT JOIN dq_data_governance.dq_sub_subject_area as SBJ_AREA_DESC_TABLE ON DATA_QUERY.SUB_SUBJECT_AREA_CD = SBJ_AREA_DESC_TABLE.SUB_SUBJECT_AREA_CD

相关:#1#2

【问题讨论】:

  • Kudu can be read with Spark。 Hive 是内置的。为什么需要 JDBC?
  • 我需要能够执行包含 Kudu 和 Hive 表的查询,但我不知道该查询中表的类型(Hive 或 Kudu)。
  • 您不能查询单个数据帧,然后使用 SparkSQL 引用/加入两者?
  • 不过,Impala 编辑器没有使用 JDBC。我相信它是 impyla 库,因为 HUE 是用 Python 编写的。
  • 因为我没有可以比较的 Impala 环境,所以不确定还有什么要调试的。但是对于发现问题的其他人,可能值得添加您正在使用的实际表模式和查询

标签: scala apache-spark hive apache-spark-sql impala


【解决方案1】:

here 试用亚马逊版 Impala 驱动程序。

使用此驱动程序时,请重新检查表和数据框之间的双重数据类型相等性。

我希望它对你有用。

【讨论】:

  • 我现在试了一下,结果还是一样。顺便说一句,我使用的是 Cloudera 环境,而不是 AWS EMR。
  • 这个 Simba 库也兼容非 emr 环境。我不知道您使用的是哪种协议,但您可以尝试使用“jdbc:hive2”协议而不是“jdbc:impala”连接到 Impala,其余的 URL 部分相同。
  • 我正在使用 Impala JDBC jdbc:impala://Host:Port[/Schema];Property1=Value;Property2=Value;... 用 hive2 替换 impala 连接失败
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2018-02-02
  • 2016-04-04
  • 2018-02-21
  • 1970-01-01
  • 2013-06-24
  • 1970-01-01
  • 2017-01-16
相关资源
最近更新 更多