【发布时间】:2020-12-11 15:27:23
【问题描述】:
我有一个 pyspark 数据框,其中有一列值为字符串 json。如何计算与字典内列表中特定值匹配的值并作为列报告?我想通过使用 Python 函数和 pyspark udf 来做到这一点。
例如,下面是数据框,df:
+---------------------------------------------------------------------------+
|col |
+---------------------------------------------------------------------------+
|{"field":{"list":[{"item":1,"upgrade":false},{"item":2,"upgrade":true}]}}
+----------------------------------------------------------------------------+
|{"field":{"list":[{"item":1,"upgrade":false},{"item":2,"upgrade":false}]}}
+--------------------------------------------------------------------------+
我想做什么:
def upgrade_false(doc):
string = str(doc)
return string.count('"upgrade":false')
df2= df.withColumn('upgrade_false', (F.udf(lambda j: upgrade_false(json.loads(j)),t.StringType()))('col'))
但它不起作用。有人能解释一下可能出了什么问题吗?
理想的结果如下所示:
+---------------------------------------------------------------------------+----------------+
|col | upgrade_false
+---------------------------------------------------------------------------+-----------------+
|{"field":{"list":[{"item":1,"upgrade":false},{"item":2,"upgrade":true}]}} | 1
+----------------------------------------------------------------------------+----------------+
|{"field":{"list":[{"item":1,"upgrade":false},{"item":2,"upgrade":false}]}} | 2
+----------------------------------------------------------------------------+-----------------+
【问题讨论】:
标签: python apache-spark dictionary pyspark apache-spark-sql