【发布时间】:2022-01-12 00:08:34
【问题描述】:
这是我的 pyspark 数据框
+--------------------------------------------------+----------+
|date |date_count|
+--------------------------------------------------+----------+
|[20210629, 20210629] |495 |
|[20210619, 20210619, 20210619] |1781 |
|[20210611] |3675263 |
|[20210611, 20210611, 20210611, 20210611, 20210611]|3 |
+--------------------------------------------------+----------+
为了给你线索,它来自像这样的旋转
from pyspark.sql.functions import max as pyspark_max, min as pyspark_min, sum as pyspark_sum, avg, count
timeseries_monthly = spark.read.options(header='True',inferschema='True',delimiter=',').parquet("url...")
date = timeseries_monthly.select( timeseries_monthly["gps.date"])
date.groupBy('date').agg(count('date').alias('date_count')).show(4,truncate=False)
这是我的预期输出
+----------+----------+
|date |date_count|
+----------+----------+
|20210629 |495 |
|20210619 |1781 |
|20210611 |3675263 |
|20210611 |3 |
+----------+----------+
【问题讨论】:
标签: pandas dataframe apache-spark pyspark