【发布时间】:2020-01-23 10:37:45
【问题描述】:
from pyspark.sql.types import *
schema = StructType([StructField("type", StringType(), True), StructField("average", IntegerType(), True)])
values = [('A', 19), ('B', 17), ('C', 10)]
df = spark.createDataFrame(values, schema)
parts = df.rdd.getNumPartitions()
print(parts)
输出为 44
spark 如何为 3 个记录数据帧创建 44 个分区?
import pyspark.sql.functions as F
df.withColumn('p_id', F.spark_partition_id()).show()
输出:
+----+-------+----+
|type|average|p_id|
+----+-------+----+
| A| 19| 14|
| B| 17| 29|
| C| 10| 43|
+----+-------+----+
【问题讨论】:
标签: apache-spark pyspark