我在 AWS EMR 集群 (emr-5.31.0) 上遇到了完全相同的问题。
在SparkSession.builder.config() 中设置spark.driver.extraClassPath 和spark.executor.extraClassPath,或 spark-defaults.conf,或用spark-submit --jars 命令到jodbc6.jar 的位置不起作用。
我终于通过将 Maven 坐标传递给 spark.jars.packages 来让它工作,然后我也不得不将 spark.driver.extraClassPath 和 spark.executor.extraClassPath 设置为 $HOME/.ivy2/jars/* .
import os
from pyspark.sql import SparkSession
spark_packages_list = [
'io.delta:delta-core_2.11:0.6.1',
'com.oracle.database.jdbc:ojdbc6:11.2.0.4',
]
spark_packages = ",".join(spark_packages_list)
home = os.getenv("HOME")
spark = (
SparkSession
.builder
.config("spark.jars.packages", spark_packages)
.config('spark.driver.extraClassPath', f"{home}/.ivy2/jars/*")
.config('spark.executor.extraClassPath', f"{home}/.ivy2/jars/*")
)
然后以下工作(相应地更改参数):
host = "111.111.111.111"
port = "1234"
schema = "YourSchema"
URL = f"jdbc:oracle:thin:@{host}:{port}/{schema}"
with open(f"{home}/username.file", "r") as f:
username = f.read()
with open(f"{home}/password.file", "r") as f:
password = f.read()
query = "SELECT * FROM YourTable"
df = (spark.read.format("jdbc")
.option("url", URL)
.option("query", query)
.option("user", username)
.option("password", password)
.load()
)
df.printSchema()
df.show()
或
properties = {
"user": username,
"password": password,
}
df = spark.read.jdbc(
url=URL,
table="YourTable",
properties=properties,
)
df.printSchema()
df.show()