将 PySpark 数据帧写入分区 Hive 表答案

【问题标题】：Write PySpark dataframe into Partitioned Hive table将 PySpark 数据帧写入分区 Hive 表
【发布时间】：2020-11-27 14:29:43
【问题描述】：

我正在学习 Spark。我有一个以下结构的数据框ts。

ts.show()
+--------------------+--------------------+
|    UTC|                 PST|
+--------------------+--------------------+
|2020-11-04 02:24:...|2020-11-03 18:24:...|
+--------------------+--------------------+

我需要将ts插入到Hive中的分区表中，结构如下，

spark.sql(""" create table db.ts_part
(
UTC timestamp,
PST timestamp
)
PARTITIONED BY(  bkup_dt DATE )
STORED AS ORC""")

如何在插入语句中动态传递system run date，以便根据日期在表中的bkup_dt 上进行分区。

我试过这样的代码。但是没有用

ts.write.partitionBy(current_date()).insertInto("db.ts_part",overwrite=False)

我该怎么做？有人可以帮忙吗！

【问题讨论】：

什么不起作用？什么是意外行为或错误代码？
我收到错误 - TypeError: Column is not iterable

标签： apache-spark pyspark hive

【解决方案1】：

尝试使用 current_date() 创建新列，然后按 hive 表分区写入。

Example:

df.\
withColumn("bkup_dt",current_date()).\
write.\
partitionBy("bkup_dt").\
insertInto("db.ts_part",overwrite=False)

UPDATE:

通过 creating temp view 尝试，然后运行 insert 语句。

df.createOrReplaceTempView("tmp")

sql("insert into table <table_name> partition (bkup_dt) select *,current_date bkup_dt from tmp")

【讨论】：

谢谢。我知道这会奏效。但这是一个巨大的数据框。有没有其他方法可以动态传递它而不是向数据框添加一列