【问题标题】:EMR PySpark write to Redshift: java.sql.SQLException: [Amazon](500310) Invalid operation: The session is read-onlyEMR PySpark 写入 Redshift:java.sql.SQLException:[Amazon](500310) 无效操作:会话是只读的
【发布时间】:2021-08-14 04:00:41
【问题描述】:

我在 EMR 集群上尝试使用 PySpark 将数据写入 Redshift 时出错。

df.write.format("jdbc") \
   .option("url", "jdbc:redshift://clustername.yyyyy.us-east-1.redshift.amazonaws.com:5439/db") \
   .option("driver", "com.amazon.redshift.jdbc42.Driver") \
   .option("dbtable", "public.table") \
   .option("user", user_redshift) \
   .option("password", password_redshift) \
   .mode("overwrite") \
   .save()

我得到的错误是:

py4j.protocol.Py4JJavaError: An error occurred while calling o143.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, , executor 1): 
java.sql.SQLException: [Amazon](500310) Invalid operation: The session is read-only;
    at com.amazon.redshift.client.messages.inbound.ErrorResponse.toErrorException(Unknown Source)
    at com.amazon.redshift.client.PGMessagingContext.handleErrorResponse(Unknown Source)
    at com.amazon.redshift.client.PGMessagingContext.handleMessage(Unknown Source)
    at com.amazon.jdbc.communications.InboundMessagesPipeline.getNextMessageOfClass(Unknown Source)
    at com.amazon.redshift.client.PGMessagingContext.doMoveToNextClass(Unknown Source)
    at com.amazon.redshift.client.PGMessagingContext.getParameterDescription(Unknown Source)
    at com.amazon.redshift.client.PGClient.prepareStatement(Unknown Source)
    at com.amazon.redshift.dataengine.PGQueryExecutor.<init>(Unknown Source)
    at com.amazon.redshift.dataengine.PGDataEngine.prepare(Unknown Source)
    at com.amazon.jdbc.common.SPreparedStatement.<init>(Unknown Source)
    ...

感谢您的帮助。谢谢!

【问题讨论】:

    标签: apache-spark pyspark amazon-redshift amazon-emr spark-redshift


    【解决方案1】:

    我们的 EMR pySpark 集群也面临同样的问题。 带有“ReleaseLabel”的 EMR:“emr-5.33.0”和 Spark 版本 2.4.7

    我们通过以下更改解决了这个问题

    1. 使用了 redshift jar:redshift-jdbc42-2.0.0.7.jar 来自https://docs.aws.amazon.com/redshift/latest/mgmt/jdbc20-previous-driver-version-20.html
    2. 将 JDBC URL 更改为以下内容:
      jdbc:redshift://clustername.yyyyy.us-east-1.redshift.amazonaws.com:5439/db?user=username&password=password;ReadOnly=false
      

    然后您可以尝试使用以下命令运行 spark-submit: spark-submit --jars s3://jars/redshift-jdbc42-2.0.0.7.jar s3://scripts/scriptname.py scriptname.py 在哪里

    df.write\
        .format('jdbc')\
        .option("driver", "com.amazon.redshift.jdbc42.Driver")\
        .option("url", jdbcUrl)\
        .option("dbtable", "schema.table")\
        .option("aws_iam_role", "XXXX") \
        .option("tempdir", f"s3://XXXXXX") \
        .mode('append')\
        .save()
    
    

    【讨论】:

      猜你喜欢
      • 2020-09-28
      • 2017-08-16
      • 2019-11-26
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多