【发布时间】:2021-02-03 15:53:29
【问题描述】:
我正在尝试将 Cloudera 的 Impala JDBC 2.6.17.1020 连接器驱动程序与 Spark 一起使用,以便能够同时访问 Kudu 和 Hive 中的表。
当查询很简单时,它工作正常,我得到了预期的输出。
val simpleQuery = """ SELECT *
FROM HIVE_TABLE
WHERE ABC = '123'
"""
val simpleQueryDF : DataFrame = spark.read.jdbc(this.impalaJDBCString,"("+simpleQuery+") as temp",this.impalaConnectionProps)
simpleQueryDF.show()
但是当查询包含嵌套查询和连接时,我将标题重复为行,如下所示
+----------+--------+------------------------+--------------+ ...
|as_of_date|dq_index|impacted_data_lake_table|pk_column_name| ...
+----------+--------+------------------------+--------------+ ...
|as_of_date|dq_index| impacted_data_lak...|pk_column_name| ...
|as_of_date|dq_index| impacted_data_lak...|pk_column_name| ...
|as_of_date|dq_index| impacted_data_lak...|pk_column_name| ...
+----------+--------+------------------------+--------------+ ...
我不知道为什么会出现这种情况,有人可以告诉我该怎么做吗?
复杂查询如下
SELECT '2020-10-22' AS AS_OF_DATE,
DATA_QUERY.DQ_INDEX,
DATA_QUERY.IMPACTED_DATA_LAKE_TABLE,
DATA_QUERY.PK_COLUMN_NAME,
DATA_QUERY.PK_OF_BAD_RECORD,
DATA_QUERY.DQ_ASSESSMENT_DIMENSION,
DATA_QUERY.SOURCE_SYSTEM,
'ABC' AS BUSINESS_SUBJECT_AREA,
DATA_QUERY.SUB_SUBJECT_AREA_CD,
NVL(SBJ_AREA_DESC_TABLE.SUB_SUBJECT_AREA, 'Not Recognized') AS SUB_SUBJECT_AREA
FROM (
SELECT 'DQ_INDEX_1' AS DQ_INDEX,
'dq_data_governance.table_1 fin_year
dq_data_governance.table_2
dq_data_governance.table_3
dq_data_governance.table_4 ' AS IMPACTED_DATA_LAKE_TABLE,
NAMES AS PK_COLUMN_NAME,
CAST(VALS AS STRING) AS PK_OF_BAD_RECORD,
'KLM' AS DQ_ASSESSMENT_DIMENSION,
'NOQ' AS SOURCE_SYSTEM,
'TTV' AS SUB_SUBJECT_AREA_CD
FROM (
SELECT 'table_4.SEGMENT4|table_4.SEGMENT8|table_1.FIN_YEAR' AS NAMES,
concat(GCC.SEGMENT4,'|',GCC.SEGMENT8,'|',CAST(FIN_YEAR as STRING)) AS VALS
FROM dq_data_governance.table_1 FIN_YEAR,
dq_data_governance.table_2 gjh,
dq_data_governance.table_3 gjl,
dq_data_governance.table_4 GCC
WHERE GJH.JE_HEADER_ID = GJL.JE_HEADER_ID
AND GJL.CODE_COMBINATION_ID = GCC.CODE_COMBINATION_ID
AND ACTUAL_FLAG = '5'
AND GCC.SEGMENT9 = '9'
AND EFFECTIVE_DATE BETWEEN FIN_YEAR.from_date AND FIN_YEAR.to_date
GROUP BY GCC.SEGMENT4,GCC.SEGMENT8,FIN_YEAR
HAVING sum((nvl((GJL.ACCOUNTED_DR), 0) - nvl((GJL.ACCOUNTED_CR), 0))) <-0.1
) AS T_DQ_INDEX_1
UNION ALL
SELECT 'DQ_INDEX_2' AS DQ_INDEX,
'dq_data_governance.table_5' AS IMPACTED_DATA_LAKE_TABLE,
NAMES AS PK_COLUMN_NAME,
CAST(VALS AS STRING) AS PK_OF_BAD_RECORD,
'BKL' AS DQ_ASSESSMENT_DIMENSION,
'CCG' AS SOURCE_SYSTEM,
'LMA' AS SUB_SUBJECT_AREA_CD
FROM (
SELECT 'table_5.contractid|table_5.name' AS NAMES,
concat(CAST(c.contractid as STRING),'|',c.name) AS VALS
FROM dq_data_governance.table_5 c
INNER JOIN dq_data_governance.table_6 ps ON ps.processinstanceid=c.processinstanceid
WHERE c.amountinriyal<1 and ps.approvalstatusid=3
) AS T_DQ_INDEX_2
UNION ALL
# ...
# <MORE QUERIES CAN BE ADDED HERE>
# ...
) AS DATA_QUERY
LEFT JOIN dq_data_governance.dq_sub_subject_area as SBJ_AREA_DESC_TABLE ON DATA_QUERY.SUB_SUBJECT_AREA_CD = SBJ_AREA_DESC_TABLE.SUB_SUBJECT_AREA_CD
【问题讨论】:
-
Kudu can be read with Spark。 Hive 是内置的。为什么需要 JDBC?
-
我需要能够执行包含 Kudu 和 Hive 表的查询,但我不知道该查询中表的类型(Hive 或 Kudu)。
-
您不能查询单个数据帧,然后使用 SparkSQL 引用/加入两者?
-
不过,Impala 编辑器没有使用 JDBC。我相信它是 impyla 库,因为 HUE 是用 Python 编写的。
-
因为我没有可以比较的 Impala 环境,所以不确定还有什么要调试的。但是对于发现问题的其他人,可能值得添加您正在使用的实际表模式和查询
标签: scala apache-spark hive apache-spark-sql impala