【发布时间】:2021-11-12 05:26:48
【问题描述】:
我在独立模式下运行 Spark
rdd
{
"name": "adam",
"gender": "male",
"new_column": "white,black,yellow"
}
新的rdd
{
"name": "adam",
"gender": "male",
"new_column": "white"
}
{
"name": "adam",
"gender": "male",
"new_column": "black"
}
{
"name": "adam",
"gender": "male",
"new_column": "yellow"
}
代码如下:
from pyspark.sql import functions as F
df = spark.read.json("test.json")
df2 = df.withColumn("new_column", F.explode(F.split("new_column", ",")))
df2.write.format('json').save('final')
架构
df.printSchema()
root
|-- name: string (nullable = true)
|-- gender: string (nullable = true)
|-- new_column: string (nullable = true)
运行文件时没有错误..
PYTHONSTARTUP=test.py pyspark
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
21/09/17 10:56:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/09/17 10:56:01 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
21/09/17 10:56:01 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.1.1
/_/
Using Python version 3.6.9 (default, Jan 26 2021 15:33:00)
Spark context Web UI available at http://root1-System-Product-Name:4042
Spark context available as 'sc' (master = local[*], app id = local-1631872561867).
SparkSession available as 'spark'.
谁能解释一下?
另外,请问保存 pysaprk 数据框以供进一步使用的最佳方法是什么?
非常感谢!
需要注意的几点:
我正在使用 pyspark 3.1.1
Python 版本 3.6.9
【问题讨论】:
-
我很确定您已经将数据保存在
test.json旁边。请记住,它是一个目录,其中包含一个或多个文件。
标签: python dataframe apache-spark pyspark