pyspark Column 不可使用 withColumn 进行迭代答案

【问题标题】：pyspark Column is not iterable using withColumnpyspark Column 不可使用 withColumn 进行迭代
【发布时间】：2019-10-30 22:07:29
【问题描述】：

为什么在使用 pyspark 时出现 column is not iterable 错误？

cost_allocation_df = cost_allocation_df.withColumn(
    'resource_tags_user_engagement',          
     f.when(
         (f.col('line_item_usage_account_id') == '123456789101', '1098765432101') &
         (f.col('resource_tags_user_engagement') == '' ) |
         (f.col('resource_tags_user_engagement').isNull()) |
         (f.col('resource_tags_user_engagement').rlike('^[a-zA-Z]')),
    '10546656565').otherwise(f.col('resource_tags_user_engagement'))
)

【问题讨论】：

您的第一个表达式包含三个值 (f.col('line_item_usage_account_id') == '123456789101', '1098765432101') 但比较运算符只能处理每个站点上的一个值（即总共两个）。
您只能从When 子句返回一个值，而从否则返回一个值，除非您嵌套它。您可以执行 F.when(condition, return value).otherwise(return value) 或 F.when().when().when().otherwise() 或 F.when(condition, F.when(condition,返回值）。否则（返回值））。否则（返回值）。请忽略拼写错误和语法。

标签： apache-spark pyspark apache-spark-sql

【解决方案1】：

您可以将列与value 进行直接比较，但这是行不通的。您必须使用 lit() 为该 value 创建一列

尝试将您的代码转换为：

cost_allocation_df = cost_allocation_df.withColumn('resource_tags_user_engagement',          
 f.when(
       ((f.col('line_item_usage_account_id') == f.lit('123456789101')) | 
       (f.col('line_item_usage_account_id') == f.lit('1098765432101'))) & 
       (f.col('resource_tags_user_engagement') == f.lit('') ) |
       (f.col('resource_tags_user_engagement').isNull()) |
       (f.col('resource_tags_user_engagement').rlike('^[a-zA-Z]')), '10546656565'
       ).otherwise(f.col('resource_tags_user_engagement')))

【讨论】：

您必须使用 lit() 创建一个包含该值的列 - 这不是真的。 f.col('line_item_usage_account_id') == '123456789101' 可以正常工作。