【发布时间】:2019-07-12 13:45:12
【问题描述】:
我有两个数据框,df1 和 df2 如下图所示:
df1.show()
+---+--------+-----+----+--------+
|c1 | c2 | c3 | c4 | c5 |
+---+--------+-----+----+--------+
| A| abc | 0.1 | 0.0| 0 |
| B| def | 0.15| 0.5| 0 |
| C| ghi | 0.2 | 0.2| 1 |
| D| jkl | 1.1 | 0.1| 0 |
| E| mno | 0.1 | 0.1| 0 |
+---+--------+-----+----+--------+
df2.show()
+---+--------+-----+----+--------+
|c1 | c2 | c3 | c4 | c5 |
+---+--------+-----+----+--------+
| A| abc | a | b | ? |
| C| ghi | a | c | ? |
+---+--------+-----+----+--------+
如果df2 中引用了该行,我想更新df1 中的c5 列并将其设置为1。每条记录由c1 和c2 列标识。
以下是所需的输出;注意第一条记录的c5值更新为1:
+---+--------+-----+----+--------+
|c1 | c2 | c3 | c4 | c5 |
+---+--------+-----+----+--------+
| A| abc | 0.1 | 0.0| 1 |
| B| def | 0.15| 0.5| 0 |
| C| ghi | 0.2 | 0.2| 1 |
| D| jkl | 1.1 | 0.1| 0 |
| E| mno | 0.1 | 0.1| 0 |
+---+--------+-----+----+--------+
【问题讨论】:
标签: python apache-spark pyspark