PySpark - 无法保存数据答案

【问题标题】：PySpark - unable to save dataPySpark - 无法保存数据
【发布时间】：2021-11-12 05:26:48
【问题描述】：

我在独立模式下运行 Spark

rdd

{
    "name": "adam",
    "gender": "male",
    "new_column": "white,black,yellow"
}

新的rdd

{
    "name": "adam",
    "gender": "male",
    "new_column": "white"
}
{
    "name": "adam",
    "gender": "male",
    "new_column": "black"
}
{
    "name": "adam",
    "gender": "male",
    "new_column": "yellow"
}

代码如下：

from pyspark.sql import functions as F

df = spark.read.json("test.json")




df2 = df.withColumn("new_column", F.explode(F.split("new_column", ",")))
df2.write.format('json').save('final')

架构

df.printSchema()
root
 |-- name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- new_column: string (nullable = true)

运行文件时没有错误..

PYTHONSTARTUP=test.py pyspark
Python 3.6.9 (default, Jan 26 2021, 15:33:00) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
21/09/17 10:56:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/09/17 10:56:01 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
21/09/17 10:56:01 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Python version 3.6.9 (default, Jan 26 2021 15:33:00)
Spark context Web UI available at http://root1-System-Product-Name:4042
Spark context available as 'sc' (master = local[*], app id = local-1631872561867).
SparkSession available as 'spark'.

谁能解释一下？

另外，请问保存 pysaprk 数据框以供进一步使用的最佳方法是什么？

非常感谢！

需要注意的几点：

我正在使用 pyspark 3.1.1

Python 版本 3.6.9

【问题讨论】：

我很确定您已经将数据保存在 test.json 旁边。请记住，它是一个目录，其中包含一个或多个文件。

标签： python dataframe apache-spark pyspark

【解决方案1】：

也许你可以将文件保存为 csv

df2.coalesce(1).write.option("header",true).option("delimiter","|").csv("file:///local path")

【讨论】：

同样的事情 - 我根本没有保存任何数据。我尝试了三种方法，但都没有奏效。