【问题标题】:Cannot Read a file from HDFS using Spark无法使用 Spark 从 HDFS 读取文件
【发布时间】:2015-02-13 04:44:57
【问题描述】:

我已经使用 cloudera manager 安装了 cloudera CDH 5。

我可以轻松做到

hadoop fs -ls /input/war-and-peace.txt
hadoop fs -cat /input/war-and-peace.txt

上述命令将在控制台上打印整个 txt 文件。

现在我启动 spark shell 并说

val textFile = sc.textFile("hdfs://input/war-and-peace.txt")
textFile.count

现在我得到一个错误

Spark 上下文可用作 sc。

scala> val textFile = sc.textFile("hdfs://input/war-and-peace.txt")
2014-12-14 15:14:57,874 INFO  [main] storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(177621) called with curMem=0, maxMem=278302556
2014-12-14 15:14:57,877 INFO  [main] storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_0 stored as values in memory (estimated size 173.5 KB, free 265.2 MB)
textFile: org.apache.spark.rdd.RDD[String] = hdfs://input/war-and-peace.txt MappedRDD[1] at textFile at <console>:12

scala> textFile.count
2014-12-14 15:15:21,791 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 0 time(s); maxRetries=45
2014-12-14 15:15:41,905 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 1 time(s); maxRetries=45
2014-12-14 15:16:01,925 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 2 time(s); maxRetries=45
2014-12-14 15:16:21,983 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 3 time(s); maxRetries=45
2014-12-14 15:16:42,001 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 4 time(s); maxRetries=45
2014-12-14 15:17:02,062 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 5 time(s); maxRetries=45
2014-12-14 15:17:22,082 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 6 time(s); maxRetries=45
2014-12-14 15:17:42,116 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 7 time(s); maxRetries=45
2014-12-14 15:18:02,138 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 8 time(s); maxRetries=45
2014-12-14 15:18:22,298 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 9 time(s); maxRetries=45
2014-12-14 15:18:42,319 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 10 time(s); maxRetries=45
2014-12-14 15:19:02,354 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 11 time(s); maxRetries=45
2014-12-14 15:19:22,373 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 12 time(s); maxRetries=45
2014-12-14 15:19:42,424 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 13 time(s); maxRetries=45
2014-12-14 15:20:02,446 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 14 time(s); maxRetries=45
2014-12-14 15:20:22,512 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 15 time(s); maxRetries=45
2014-12-14 15:20:42,515 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 16 time(s); maxRetries=45
2014-12-14 15:21:02,550 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 17 time(s); maxRetries=45
2014-12-14 15:21:22,558 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 18 time(s); maxRetries=45
2014-12-14 15:21:42,683 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 19 time(s); maxRetries=45
2014-12-14 15:22:02,702 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 20 time(s); maxRetries=45
2014-12-14 15:22:22,832 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 21 time(s); maxRetries=45
2014-12-14 15:22:42,852 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 22 time(s); maxRetries=45
2014-12-14 15:23:02,974 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 23 time(s); maxRetries=45
2014-12-14 15:23:22,995 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 24 time(s); maxRetries=45
2014-12-14 15:23:43,109 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 25 time(s); maxRetries=45
2014-12-14 15:24:03,128 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 26 time(s); maxRetries=45
2014-12-14 15:24:23,250 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 27 time(s); maxRetries=45
java.net.ConnectException: Call From dn1home/192.168.1.21 to input:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
        at org.apache.hadoop.ipc.Client.call(Client.java:1415)

为什么会出现这个错误?我可以使用 hadoop 命令读取相同的文件吗?

【问题讨论】:

    标签: hadoop apache-spark cloudera-cdh


    【解决方案1】:

    解决办法

    sc.textFile("hdfs://nn1home:8020/input/war-and-peace.txt")
    

    我是如何找到 nn1home:8020 的?

    只需搜索文件core-site.xml 并查找xml 元素fs.defaultFS

    【讨论】:

    • core-site.xml 始终位于 Spark 本地或集群安装的 conf 目录中。
    • 对我来说配置文件位于$HADOOP_HOME/etc/hadoop/core-site.xml
    • 没有mn1home:8020应该是sc.textFile("hdfs:////input/war-and-peace.txt")
    • 这是你在hadoop集群上运行代码的时候,而不是远程运行。对吧?
    • 我在 gcp dataproc 上运行,没有设置 $HADOOP_HOME 环境变量。如何找到 core-site.xml?
    【解决方案2】:

    如果您想使用sc.textFile("hdfs://..."),您需要提供完整路径(绝对路径),在您的示例中为“nn1home:8020/..”

    如果你想让它变得简单,那就用sc.textFile("hdfs:/input/war-and-peace.txt")

    只有一个/

    【讨论】:

      【解决方案3】:

      这将起作用:

      val textFile = sc.textFile("hdfs://localhost:9000/user/input.txt")
      

      在这里,您可以从 hadoop core-site.xml 配置文件的fs.defaultFS 参数值中获取localhost:9000

      【讨论】:

        【解决方案4】:

        您没有传递正确的 url 字符串。

        • hdfs:// - 协议类型
        • localhost - ip 地址(对你来说可能不同,例如 - 127.56.78.4)
        • 54310 - 端口号
        • /input/war-and-peace.txt - 要加载的文件的完整路径。

        最后网址应该是这样的

        hdfs://localhost:54310/input/war-and-peace.txt
        

        【讨论】:

          【解决方案5】:

          如果您使用在 spark-env.sh 中设置的 HADOOP_HOME 启动 spark,spark 将知道在哪里查找 hdfs 配置文件。

          在这种情况下,spark 已经知道你的 namenode/datanode 的位置,只有在下面才能正常访问 hdfs 文件;

          sc.textFie("/myhdfsdirectory/myfiletoprocess.txt")
          

          您可以如下创建您的 myhdfs 目录;

          hdfs dfs -mkdir /myhdfsdirectory
          

          从本地文件系统中,您可以使用以下命令将 myfiletoprocess.txt 移动到 hdfs 目录

          hdfs dfs -copyFromLocal mylocalfile /myhdfsdirectory/myfiletoprocess.txt
          

          【讨论】:

            【解决方案6】:

            我也在使用 CDH5。对我来说,完整路径,即“hdfs://nn1home:8020”由于某种奇怪的原因无法正常工作。大多数示例都显示了这样的路径。

            我使用了类似的命令

            val textFile=sc.textFile("hdfs:/input1/Card_History2016_3rdFloor.csv")
            

            上述命令的o/p:

            textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:22
            
            textFile.count
            
            res1: Long = 58973  
            

            这对我来说很好。

            【讨论】:

              【解决方案7】:

              这对我有用

                 logFile = "hdfs://localhost:9000/sampledata/sample.txt"
              

              【讨论】:

                【解决方案8】:
                 val conf = new SparkConf().setMaster("local[*]").setAppName("HDFSFileReader")
                conf.set("fs.defaultFS", "hdfs://hostname:9000")
                val sc = new SparkContext(conf)
                val data = sc.textFile("hdfs://hostname:9000/hdfspath/")
                data.saveAsTextFile("C:\\dummy\")
                

                上面的代码从目录中读取所有hdfs文件,并将其保存在本地c://dummy文件夹中。

                【讨论】:

                  【解决方案9】:

                  也可能是文件路径或 URL 和 hdfs 端口的问题。

                  解决方案: 首先从位置$HADOOP_HOME/etc/hadoop 打开core-site.xml 文件并检查属性fs.defaultFS 的值。 假设值是hdfs://localhost:9000,hdfs 中的文件位置是/home/usr/abc/fileName.txt。 然后,文件 URL 将是:hdfs://localhost:9000/home/usr/abc/fileName.txt 以及用于从 hdfs 读取文件的以下命令:

                  var result= scontext.textFile("hdfs://localhost:9000/home/usr/abc/fileName.txt", 2)
                  

                  【讨论】:

                    【解决方案10】:

                    从 core-site.xml(/etc/hadoop/conf) 获取 fs.defaultFS URL 并读取如下文件。就我而言, fs.defaultFS 是 hdfs://quickstart.cloudera:8020

                    txtfile=sc.textFile('hdfs://quickstart.cloudera:8020/user/cloudera/rddoutput') txtfile.collect()

                    【讨论】:

                      猜你喜欢
                      • 2019-10-25
                      • 1970-01-01
                      • 2016-12-03
                      • 1970-01-01
                      • 2021-06-18
                      • 1970-01-01
                      • 2021-06-25
                      • 1970-01-01
                      • 2019-08-22
                      相关资源
                      最近更新 更多