【发布时间】:2019-02-13 19:48:20
【问题描述】:
spark1.6,从我的 Vertica 数据库中检索数据以对其进行处理 以下查询在 vertica db 上运行良好,但在 pyspark 上不起作用,Spark DataFrames 支持使用 JDBC 源进行谓词下推,但术语谓词是用于严格的 SQL 含义。这意味着它仅涵盖 WHERE 子句。此外,它看起来仅限于逻辑连接(恐怕没有 IN 和 OR)和简单的谓词,它显示了这个错误:java.lang.RuntimeException: Option 'dbtable' not specified
conf = (SparkConf()
.setAppName("hivereader")
.setMaster("yarn-client")
.set("spark.dynamicAllocation.enabled", "false")
.set("spark.shuffle.service.enabled", "false")
.set("spark.io.compression.codec", "snappy")
.set("spark.rdd.compress", "true")
.set("spark.executor.cores" , 7)
.set("spark.sql.inMemoryStorage.compressed", "true")
.set("spark.sql.shuffle.partitions" , 2000)
.set("spark.sql.tungsten.enabled" , 'true')
.set("spark.port.maxRetries" , 200))
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
url = "*******"
properties = {"user": "*****", "password": "*******", "driver": "com.vertica.jdbc.Driver" }
df = sqlContext.read.format("JDBC").options(
url = url,
query = "SELECT date(time_stamp) AS DATE, (subscriber) AS IMSI, (server_hostname) AS WEBSITE, (bytes_in) AS DOWNLINK, (bytes_out) AS UPLINK,(connections_out) AS CONNECTION FROM traffic.stats WHERE DATE(time_stamp) between '2019-01-25' AND '2019-01-29'",
**properties
).load()
df.show()
【问题讨论】:
标签: sql apache-spark pyspark vertica