我们可以使用 JDBC 将数据从 postgresql 写入 Spark 吗？答案

【问题标题】：Can we use JDBC to write data from postgresql to Spark?我们可以使用 JDBC 将数据从 postgresql 写入 Spark 吗？
【发布时间】：2019-12-06 16:08:47
【问题描述】：

我正在尝试将我在 PostgreSQL 上的表加载到 Spark。我已经使用 jdbc 成功地将表从 PostgreSQL 读取到 Spark。我有一个用 R 编写的代码，我想在表格上使用它，但我无法访问 R 中的数据。

使用以下代码连接

 val pgDF_table = spark.read
                          .format("jdbc")
                          .option("driver", "org.postgresql.Driver")
                          .option("url", "jdbc:postgresql://10.128.0.4:5432/sparkDB")
                          .option("dbtable", "survey_results")
                          .option("user", "prashant")
                          .option("password","pandey")
                          .load()
    pgDF_table.show

spark.write 有什么选择吗？

【问题讨论】：

您可能会发现spark.rstudio.com 很有帮助。它允许您直接从 R 而非 Scala 定义 Spark 作业，并将 Spark 中的数据导入 R 进行进一步处理。
但我的数据在 postgresql 中，没有任何关系。

标签： r postgresql apache-spark jdbc apache-spark-sql

【解决方案1】：

在 SparkR 中，

您可以使用以下代码read 来自 JDBC 的数据：

read.jdbc(url, tableName, partitionColumn = NULL, lowerBound = NULL,
  upperBound = NULL, numPartitions = 0L, predicates = list(), ...)

参数

`url':  JDBC database url of the form 'jdbc:subprotocol:subname'

`tableName':    the name of the table in the external database

`partitionColumn':  the name of a column of integral type that will be used for partitioning

`lowerBound':   the minimum value of 'partitionColumn' used to decide partition stride

`upperBound':   the maximum value of 'partitionColumn' used to decide partition stride

`numPartitions':    the number of partitions, This, along with 'lowerBound' (inclusive), 'upperBound' (exclusive), form partition strides for generated WHERE clause expressions used to split the column 'partitionColumn' evenly. This defaults to SparkContext.defaultParallelism when unset.

`predicates':   a list of conditions in the where clause; each one defines one partition

数据可以written到JDBC，使用如下代码：

write.jdbc(x, url, tableName, mode = "error", ...)

参数

`x`: a SparkDataFrame.

`url`: JDBC database url of the form jdbc:subprotocol:subname.

`tableName`: yhe name of the table in the external database.

`mode`: one of 'append', 'overwrite', 'error', 'ignore' save mode (it is 'error' by default).

`...`: additional JDBC database connection properties.

JDBC 驱动程序必须在 spark 类路径中

【讨论】：

使用 jdbc 给我以下错误：错误：jdbc 中的错误：java.sql.SQLException：没有合适的驱动程序错误 RBackendHandler：jdbc on 16 failed java.lang.reflect.InvocationTargetException
在类路径中添加 postgresql jdbc 驱动程序。 dataxone.com/import-export-postgresql-data-sparkr-dataframe
成功添加路径后，读取文件时出现以下错误：```错误：运算符不存在：字符变化=整数提示：没有运算符与给定的名称和参数类型匹配（ s)。您可能需要添加显式类型转换。````
那么知道如何在谓词中显式定义类型转换吗？
@VasudhaJain：我猜您正在尝试将整数与 varchar 进行比较。检查这个答案stackoverflow.com/a/25358092/5019163