【发布时间】:2019-07-21 12:22:38
【问题描述】:
以下是 Dataframe sn-p 示例:
+-------------------+--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_lid |trace |message |
+-------------------+--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1103960793391132675|47c10fda9b40407c998c154dc71a9e8c|[app.py:208] Prediction label: {"id": 617, "name": "CENSORED"}, score=0.3874854505062103 |
|1103960793391132676|47c10fda9b40407c998c154dc71a9e8c|[app.py:224] Similarity values: [0.6530804801919593, 0.6359653379418201] |
|1103960793391132677|47c10fda9b40407c998c154dc71a9e8c|[app.py:317] Predict=s3://CENSORED/scan_4745/scan4745_t1_r0_c9_2019-07-15-10-32-43.jpg trait_id=112 result=InferenceResult(predictions=[Prediction(label_id='230', label_name='H3', probability=0.0), Prediction(label_id='231', label_name='Other', probability=1.0)], selected=Prediction(label_id='231', label_name='Other', probability=1.0)). Took 1.3637824058532715 seconds |
+-------------------+--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
我有数百万个类似日志的结构,它们都可以按会话独有的跟踪进行分组。
我希望将这些行集转换为单行,本质上是对它们进行映射,对于此示例,我将从第一个名称中提取 "id": 617,从第二行中提取值 0.6530804801919593, 0.6359653379418201,从第三行中提取将Prediction(label_id='231', label_name='Other', probability=1.0) 值排成一行。
然后我将组成一个包含列的新表:
| trace | id | similarity | selected |
与价值观:
| 47c10fda9b40407c998c154dc71a9e8c | 617 | 0.6530804801919593, 0.6359653379418201 | 231 |
我应该如何在 pyspark 中的多行上实现这个组映射转换?
【问题讨论】:
标签: dataframe apache-spark pyspark apache-spark-sql