【问题标题】:Pyspark Dataframe Iterate Array ColumnsPyspark Dataframe 迭代数组列
【发布时间】:2022-06-29 00:35:14
【问题描述】:

在 PySpark 中,我有一个数据框,我试图用数组解析多个列。数据框中的最后两行包含多个值,我想将其解析为单独的行。

+-------------+---------------+-------------+--------------------+--------------+-------------+----------------------+--------------+
| WB-API-CNTY | WB-API-UNIQUE | WB-OIL-CODE | WB-OIL-LSE-NBR     | WB-OIL-DIST  | WB-GAS-CODE | WB-GAS-RRC-ID        | WB-GAS-DIS   |
+-------------+---------------+-------------+--------------------+--------------+-------------+----------------------+--------------+
| 449         | 80212         | []          | []                 | []           | []          | []                   | []           |
+-------------+---------------+-------------+--------------------+--------------+-------------+----------------------+--------------+
| 449         | 80214         | ["O"]       | ["05361"]          | ["06"]       | ["O"]       | ["060536"]           | ["00"]       |
+-------------+---------------+-------------+--------------------+--------------+-------------+----------------------+--------------+
| 449         | 80222         | ["O", "O"]  | ["01718", "05492"] | ["06", "06"] | ["O", "O"]  | ["060171", "060549"] | ["00", "00"] |
+-------------+---------------+-------------+--------------------+--------------+-------------+----------------------+--------------+
| 451         | 00005         | ["G", "O"]  | ["5568", "04351"]  | ["10", "09"] | ["G", "O"]  | ["105568", "090435"] | ["09", "00"] |
+-------------+---------------+-------------+--------------------+--------------+-------------+----------------------+--------------+

结果:

+-------------+---------------+-------------+----------------+-------------+-------------+---------------+------------+
| WB-API-CNTY | WB-API-UNIQUE | WB-OIL-CODE | WB-OIL-LSE-NBR | WB-OIL-DIST | WB-GAS-CODE | WB-GAS-RRC-ID | WB-GAS-DIS |
+-------------+---------------+-------------+----------------+-------------+-------------+---------------+------------+
| 449         | 80212         |             |                |             |             |               |            |
+-------------+---------------+-------------+----------------+-------------+-------------+---------------+------------+
| 449         | 80214         | O           | 05361          | 06          | O           | 060536        | 00         |
+-------------+---------------+-------------+----------------+-------------+-------------+---------------+------------+
| 449         | 80222         | O           | 01718          | 06          | O           | 060171        | 00         |
+-------------+---------------+-------------+----------------+-------------+-------------+---------------+------------+
| 449         | 80222         | O           | 05492          | 06          | O           | 060549        | 00         |
+-------------+---------------+-------------+----------------+-------------+-------------+---------------+------------+
| 451         | 00005         | G           | 5568           | 10          | G           | 105568        | 09         |
+-------------+---------------+-------------+----------------+-------------+-------------+---------------+------------+
| 451         | 00005         | O           | 04351          | 09          | O           | 090435        | 00         |
+-------------+---------------+-------------+----------------+-------------+-------------+---------------+------------+

【问题讨论】:

标签: arrays dataframe apache-spark pyspark apache-spark-sql


【解决方案1】:
array_cols = ['WB-OIL-CODE', 'WB-OIL-LSE-NBR', 'WB-OIL-DIST', 'WB-GAS-CODE', 'WB-GAS-RRC-ID', 'WB-GAS-DIS']
other_cols = [c for c in df.columns if c not in array_cols]
zipped = F.arrays_zip(*array_cols)
df = df.select(
    *other_cols,
    F.explode(zipped)
).select(
    *other_cols,
    *[f'col.{c}' for c in array_cols]
)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2016-08-23
    • 2021-02-23
    • 2022-09-23
    • 2019-06-08
    • 1970-01-01
    • 2022-06-12
    • 2016-12-23
    • 2019-12-17
    相关资源
    最近更新 更多