Spark：添加两列并用从其他列计算的数据填充它们答案

【问题标题】：Spark: add two columns and fill them with data computed from other columnsSpark：添加两列并用从其他列计算的数据填充它们
【发布时间】：2016-12-04 19:21:50
【问题描述】：

使用 pyspark 2.0.1

我有这个数据框

+-----------+----------+
| Longitude | Latitude |
+-----------+----------+
|  1        |  3       |
|  2        |  1       |
|  2        |  3       |
+-----------+----------+

我想有效地添加两个名为 City, Province 的列，对于每一行，使用列的值（经度和纬度）作为我已经编写的返回城市和省的 python 函数的输入。所以输出应该是这样的

    +-----------+----------+--------+--------
    | Longitude | Latitude | City  | Province
    +-----------+----------+--------+--------
    |  1        |  3       | London| London
    |  2        |  1       | Paris | Paris
    |  2        |  3       | Dubai | Dubai
    +-----------+----------+--------+--------

【问题讨论】：

标签： python python-3.x pyspark

【解决方案1】：

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def city(lat, long): your code
def province(lat, long): your code

cityUdf = udf(city, StringType())
provinceUdf = udf(province, StringType())

df2 = df.withColumn("city", cityUdf(df["Latitude"], df["Longitude"]))
df3 = df2.withColumn("province", provinceUdf(df2["Latitude"], df2["Longitude"]))

【讨论】：