【发布时间】:2018-08-31 18:13:46
【问题描述】:
我们在 S3 存储桶(CSV)中有数据(表),需要应用连接转换并将结果存储在 S3 中。 表连接成功但是 S3 中生成的列列表是混乱的,输出文件中没有维护顺序。 输出文件还包含额外的引号(“)和点(。)
仅应用映射时,顺序不会改变,产生正确的输出而不会出现混乱。
脚本在 python 或 Scala 中。
脚本:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "testdb", table_name = "table1", transformation_ctx = "datasource0")
datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "testdb", table_name = "reftable", transformation_ctx = "datasource1")
datasource2 =datasource1.join(["aaaaaaaaaid"],["aaaaaaaaaid"],datasource0,transformation_ctx="join")
datasink2 = glueContext.write_dynamic_frame.from_options(frame = datasource2, connection_type = "s3", connection_options = {"path": "s3://testing/Output"}, format = "csv", transformation_ctx = "datasink2")
job.commit()
需要帮助!!!
【问题讨论】:
标签: python scala dataframe join aws-glue