【问题标题】:How to specify join types in AWS Glue?如何在 AWS Glue 中指定连接类型?
【发布时间】:2025-12-21 01:35:07
【问题描述】:

我正在使用 AWS Glue 连接两个表。默认情况下,它执行 INNER JOIN。我想做一个左外连接。我参考了 AWS Glue 文档,但无法将连接类型传递给 Join.apply() 方法。有没有办法在 AWS Glue 中实现这一点?

## @type: Join
## @args: [keys1 = id, keys2 = "user_id"]
## @return: cUser
## @inputs: [frame1 = cUser0, frame2 = cUserLogins]
#cUser = Join.apply(frame1 = cUser0, frame2 = +, keys1 = "id", keys2 = "user_id", transformation_ctx = "<transformation_ctx>")


## @type: Join
## @args: [keys1 = id, keys2 = user_id]
## @return: datasource0
## @inputs: [frame1 = cUser, frame2 = cKKR]
datasource0 = Join.apply(frame1 = cUser0, frame2 = cKKR, keys1 = "id", keys2 = "user_id", transformation_ctx = "<transformation_ctx>")

## @type: Join
## @args: [keys1 = branch_id, keys2 = user_id]
## @return: datasource1
## @inputs: [frame1 = datasource0, frame2 = cBranch]
datasource1 = Join.apply(frame1 = datasource0, frame2 = cBranch, keys1 = "branch_id", keys2 = "user_id", transformation_ctx = "<transformation_ctx>")

【问题讨论】:

    标签: pyspark etl aws-glue


    【解决方案1】:

    如果您导入 DynamicFrames from awsglue.dynamicframe import DynamicFrame, 然后你可以做

    dataSource2 = DynamicFrame.fromDF(datasource0.join(datasource1, (datasource0['user_id'] == datasource1['user_id']), "left"), glueContext, "dataSource2")
    

    【讨论】:

      【解决方案2】:

      目前,AWS Glue 不支持 LEFT 和 RIGHT 联接。但是,我们仍然可以通过将 DynamicFrame 转换为 DataFrame 并使用 join 方法来实现。

      这里是例子:

      cUser0 = glueContext.create_dynamic_frame.from_catalog(database = "captains", table_name = "cp_txn_winds_karyakarta_users", transformation_ctx = "cUser")
      
      cUser0DF = cUser0.toDF()
      
      cKKR = glueContext.create_dynamic_frame.from_catalog(database = "captains", table_name = "cp_txn_winds_karyakarta_karyakartas", redshift_tmp_dir = args["TempDir"], transformation_ctx = "cKKR")
      
      cKKRDF = cKKR.toDF()
      
      dataSource0 = cUser0DF.join(cKKRDF, cUser0DF.id == cKKRDF.user_id,how='left_outer')
      

      【讨论】:

      • @Vikash ,我们如何才能将两个带有选定字段的表连接起来,因为我的主表有 1000 多个字段?
      • 你在动态框架上尝试过SelectFields转换吗?
      • 不知道这个,因为我是 Glue 的新手,有任何可以参考的示例代码。下面是我试图通过 Glue 运行的查询。
      • SELECT v.col1, v.col2 , s.col3 FROM (SELECT col1,col2 FROM t1 WHERE col1 > 0 ) v LEFT JOIN (SELECT col1, col3 FROM t2 WHERE col1 > 0 GROUP BY col1 ) s ON v.col1 = s.col1
      最近更新 更多