Spark 无法正确读取文件答案

【问题标题】：Spark doesn't read the file properlySpark 无法正确读取文件
【发布时间】：2018-07-24 17:34:08
【问题描述】：

我运行 Flume 将 Twitter 数据摄取到 HDFS（JSON 格式）并运行 Spark 来读取该文件。

但不知何故，它没有返回正确的结果：文件的内容似乎没有更新。

这是我的 Flume 配置：

TwitterAgent01.sources = Twitter
TwitterAgent01.channels = MemoryChannel01
TwitterAgent01.sinks = HDFS

TwitterAgent01.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent01.sources.Twitter.channels = MemoryChannel01
TwitterAgent01.sources.Twitter.consumerKey = xxx
TwitterAgent01.sources.Twitter.consumerSecret = xxx
TwitterAgent01.sources.Twitter.accessToken = xxx
TwitterAgent01.sources.Twitter.accessTokenSecret = xxx
TwitterAgent01.sources.Twitter.keywords = some_keywords

TwitterAgent01.sinks.HDFS.channel = MemoryChannel01
TwitterAgent01.sinks.HDFS.type = hdfs
TwitterAgent01.sinks.HDFS.hdfs.path = hdfs://hadoop01:8020/warehouse/raw/twitter/provider/m=%Y%m/
TwitterAgent01.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent01.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent01.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent01.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent01.sinks.HDFS.hdfs.rollCount = 0
TwitterAgent01.sinks.HDFS.hdfs.rollInterval = 86400

TwitterAgent01.channels.MemoryChannel01.type = memory
TwitterAgent01.channels.MemoryChannel01.capacity = 10000
TwitterAgent01.channels.MemoryChannel01.transactionCapacity = 10000

然后我用hdfs dfs -cat检查输出，它返回超过1000行，这意味着数据被成功插入。

但在 Spark 中并非如此

spark.read.json("/warehouse/raw/twitter/provider").filter("m=201802").show()

只有 6 行。

我错过了什么吗？

【问题讨论】：

标签： apache-spark hdfs flume

【解决方案1】：

我不完全确定您为什么将路径的后半部分指定为filter 的条件表达式。

我相信你可以正确阅读你的文件：

spark.read.json("/warehouse/raw/twitter/provider/m=201802").show()

【讨论】：

感谢@stefanobaghino，我也试过了...但是，spark 返回的行数不正确
@Yusata 你用的是什么版本的 Spark？
我使用 spark 2.2.0、flume 1.7.0 和 hadoop 2.7.3
你能检查一下 Flume 是如何格式化你的 JSON 输出的吗？您可能需要设置spark.read.option("multiLine", true).json(path)，具体取决于 Flume 是输出带有对象数组的实际 JSON 还是每行一个对象的文本文件，这是 Spark 默认情况下所期望的。