【发布时间】:2022-01-13 13:51:58
【问题描述】:
下面是我的 DF:
deviceDict = {'TABLET' : 'MOBILE', 'PHONE':'MOBILE', 'PC':'Desktop', 'CEDEX' : '', 'ST' : 'SAINT', 'AV' : 'AVENUE', 'BD': 'BOULEVARD'}
df = spark.createDataFrame([('TABLET', 'DAF ST PAQ BD'), ('PHONE', 'AVOTHA'), ('PC', 'STPA CEDEX'), ('OTHER', 'AV DAF'), (None, None)], ["device_type", 'City'])
df.show()
输出:
+-----------+-------------+
|device_type| City|
+-----------+-------------+
| TABLET|DAF ST PAQ BD|
| PHONE| AVOTHA|
| PC| STPA CEDEX|
| OTHER| AV DAF|
| null| null|
+-----------+-------------+
目的是替换键/值,解决方案来自Pyspark: Replacing value in a column by searching a dictionary
tests = df.na.replace(deviceDict, 1)
结果:
+-----------+-------------+
|device_type| City|
+-----------+-------------+
| MOBILE|DAF ST PAQ BD|
| MOBILE| AVOTHA|
| Desktop| STPA CEDEX|
| OTHER| AV DAF|
| null| null|
+-----------+-------------+
它适用于 device_type,但我无法更改 city(即使使用子集)
预期输出:
+-----------+------------------------+
|device_type| City|
+-----------+------------------------+
| MOBILE| DAF SAINT PAQ BOULEVARD|
| MOBILE| AVOTHA|
| Desktop| STPA|
| OTHER| AVENUE DAF|
| null| null|
+-----------+------------------------+
【问题讨论】:
标签: python dataframe apache-spark pyspark apache-spark-sql