【问题标题】:Pyspark count for each distinct value in column for multiple columns多列的列中每个不同值的 Pyspark 计数
【发布时间】:2021-08-28 17:38:46
【问题描述】:

我有以下数据框

+---+-----+-------+---+
| id|state|country|zip|
+---+-----+-------+---+
|  1| AAA |    USA|123|
|  2| XXX |    CHN|234|
|  3| AAA |    USA|123|
|  4| PPP |    USA|222|
|  5| PPP |    USA|222|
|  5| XXX |    CHN|234|
+---+-----+-------+---+

我想创建一个平面数据框,其中包含用于计算每列中每个不同值的数组,如下所示:

+-------------------------+--------------------+------------------------+
|state                    |country             |zip                     |
+-------------------------+--------------------+------------------------+
|[[AAA, 2],[PPP,2][XXX,2]]|[[USA, 4],[CHN,123]]|[123, 2],[234,2][222,2]]|
+-------------------------+--------------------+------------------------+

原始表有 600 多个列,但我的目标是对总共仅包含少于 100 个唯一值的列执行此操作。

【问题讨论】:

    标签: python apache-spark pyspark


    【解决方案1】:

    您可以分别计算每列的每列值,然后将结果连接起来:

    from pyspark.sql import functions as F
    
    df = ...
    
    #get all column names and remove the id column from this list
    cols = df.schema.fieldNames()
    cols.remove("id")
    
    #for each column count the values
    dfs = []
    for col in cols:
        dfs.append(df.groupBy(col).count().agg(F.collect_list(F.array(col, "count")).alias(col)))
    
    #combine the results for each column into a single dataset
    import functools
    result = functools.reduce(lambda l,r:l.crossJoin(r), dfs)
    result.show(truncate=False)
    

    输出:

    +------------------------------+--------------------+------------------------------+
    |state                         |country             |zip                           |
    +------------------------------+--------------------+------------------------------+
    |[[PPP, 2], [XXX, 2], [AAA, 2]]|[[USA, 4], [CHN, 2]]|[[222, 2], [234, 2], [123, 2]]|
    +------------------------------+--------------------+------------------------------+
    

    【讨论】:

      猜你喜欢
      • 2018-10-24
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-07-16
      • 1970-01-01
      • 2021-08-05
      相关资源
      最近更新 更多