【发布时间】:2018-08-27 23:01:25
【问题描述】:
我需要逐行读取文件并将每一行拆分为单词并对单词执行操作。
我该怎么做?
我写了以下代码:
logFile = "/home/hadoop/spark-2.3.1-bin-hadoop2.7/README.md" # Should be
some file on your system
spark = SparkSession.builder.appName("SimpleApp1").getOrCreate()
logData = spark.read.text(logFile).cache()
logData.printSchema()
logDataLines = logData.collect()
#The line variable below seems to be of type row. How I perform similar operations
on row or how do I convert row to a string.
for line in logDataLines:
words = line.select(explode(split(line,"\s+")))
for word in words:
print(word)
print("----------------------------------")
【问题讨论】:
-
通过使用
collect(),您将收集驱动程序节点上的所有数据,即如果您这样做,则无需使用Spark。这个问题展示了如何拆分数据框中的列并将其分解:stackoverflow.com/questions/38210507/explode-in-pyspark
标签: apache-spark pyspark pyspark-sql