启动hadoop
[email protected]:/usr/local/hadoop-2.7.5/sbin#./start-all.sh
This script is Deprecated. Instead use start-dfs.shand start-yarn.sh
Starting namenodes on [master]
master: starting namenode, logging to/usr/local/hadoop-2.7.5/logs/hadoop-root-namenode-master.out
slave02: starting datanode, logging to/usr/local/hadoop-2.7.5/logs/hadoop-root-datanode-slave02.out
slave01: starting datanode, logging to/usr/local/hadoop-2.7.5/logs/hadoop-root-datanode-slave01.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to/usr/local/hadoop-2.7.5/logs/hadoop-root-secondarynamenode-master.out
starting yarn daemons
starting resourcemanager, logging to/usr/local/hadoop-2.7.5/logs/yarn-root-resourcemanager-master.out
slave02: starting nodemanager, logging to/usr/local/hadoop-2.7.5/logs/yarn-root-nodemanager-slave02.out
slave01: starting nodemanager, logging to/usr/local/hadoop-2.7.5/logs/yarn-root-nodemanager-slave01.out
[email protected]:/usr/local/hadoop-2.7.5/sbin#
启动spark
[email protected]:/usr/local/spark/sbin# ./start-all.sh
starting org.apache.spark.deploy.master.Master,logging to/usr/local/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-master.out
slave01: starting org.apache.spark.deploy.worker.Worker,logging to/usr/local/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-slave01.out
slave02: startingorg.apache.spark.deploy.worker.Worker, logging to/usr/local/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-slave02.out
slave01: failed to launchorg.apache.spark.deploy.worker.Worker:
slave01: full log in/usr/local/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-slave01.out
slave02: failed to launchorg.apache.spark.deploy.worker.Worker:
slave02: full log in/usr/local/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-slave02.out
[email protected]:/usr/local/spark/sbin#
查看运行情况
[email protected]:/usr/local/spark/sbin#jps
3042 Master
3124 Jps
2565 NameNode
565ResourceManager
2758SecondaryNameNode
[email protected]:/usr/bin#jps
1152 Jps
922 NodeManager
812 DataNode
1084 Worker
[email protected]:/usr/local/spark/python/lib#jps
993 Worker
721 DataNode
1061 Jps
831 NodeManager
查看web界面
在宿主机(即在安装docker的虚拟机中),打开浏览器,输入master的IP:8080查看,此时宿主机是可以访问docker中的容器的
运行python程序
[email protected]:~/pysparkfile#python3 text.py
Setting defaultlog level to "WARN".
To adjustlogging level use sc.setLogLevel(newLevel).
SLF4J: Classpath contains multiple SLF4J bindings.
SLF4J: Foundbinding in[jar:file:/usr/local/spark/jars/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Foundbinding in [jar:file:/usr/local/hadoop-2.7.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindingsfor an explanation.
SLF4J: Actualbinding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/04/2207:50:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library foryour platform... using builtin-java classes where applicable
Lines with a:61, Lines with b: 27
查看web界面并没有什么变化
启动pyspark
[email protected]:/usr/local/spark# pyspark
/usr/local/spark/bin/pyspark: line 53: python:command not found
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright","credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level usesc.setLogLevel(newLevel).
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in[jar:file:/usr/local/spark/jars/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in[jar:file:/usr/local/hadoop-2.7.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindingsfor an explanation.
SLF4J: Actual binding is of type[org.slf4j.impl.Log4jLoggerFactory]
18/04/22 07:53:12 WARN util.NativeCodeLoader:Unable to load native-hadoop library for your platform... using builtin-javaclasses where applicable
Welcome to
____ __
/__/__ ___ _____/ /__
_\ \/ _\/ _ `/ __/ '_/
/__ /.__/\_,_/_/ /_/\_\ version 2.0.2
/_/
Using Python version 3.5.2 (default, Nov 23 201716:37:01)
SparkSession available as 'spark'.
>>>
查看web界面也没有什么变化
用yarn模式启动pyspark
[email protected]:/usr/local/spark# pyspark --master yarn
/usr/local/spark/bin/pyspark: line 53: python:command not found
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright","credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level usesc.setLogLevel(newLevel).
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in[jar:file:/usr/local/spark/jars/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in[jar:file:/usr/local/hadoop-2.7.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindingsfor an explanation.
SLF4J: Actual binding is of type[org.slf4j.impl.Log4jLoggerFactory]
18/04/22 07:55:10 WARN util.NativeCodeLoader:Unable to load native-hadoop library for your platform... using builtin-javaclasses where applicable
18/04/22 07:55:29 WARN yarn.Client: Neitherspark.yarn.jars nor spark.yarn.archive is set, falling back to uploadinglibraries under SPARK_HOME.
18/04/22 08:02:33 ERROR spark.SparkContext: Errorinitializing SparkContext.
org.apache.spark.SparkException: Yarn applicationhas already ended! It might have been killed or unable to launch applicationmaster.
atorg.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
atorg.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
atorg.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:497)
atorg.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
atsun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
atsun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
atsun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
atjava.lang.reflect.Constructor.newInstance(Constructor.java:423)
atpy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
atpy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
atpy4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
atpy4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
18/04/22 08:02:33 WARNcluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to requestexecutors before the AM has registered!
18/04/22 08:02:34 WARN metrics.MetricsSystem:Stopping a MetricsSystem that is not running
Traceback (most recent call last):
File"/usr/local/spark/python/pyspark/shell.py", line 42, in<module>
SparkContext._jvm.org.apache.hadoop.hive.conf.HiveConf()
TypeError: 'JavaPackage' object is not callable
During handling of the above exception, anotherexception occurred:
Traceback (most recent call last):
File"/usr/local/spark/python/pyspark/shell.py", line 49, in<module>
spark =SparkSession.builder.getOrCreate()
File"/usr/local/spark/python/pyspark/sql/session.py", line 169, ingetOrCreate
sc =SparkContext.getOrCreate(sparkConf)
File"/usr/local/spark/python/pyspark/context.py", line 294, ingetOrCreate
SparkContext(conf=conf or SparkConf())
File"/usr/local/spark/python/pyspark/context.py", line 115, in __init__
conf,jsc, profiler_cls)
File"/usr/local/spark/python/pyspark/context.py", line 168, in _do_init
self._jsc= jsc or self._initialize_context(self._conf._jconf)
File"/usr/local/spark/python/pyspark/context.py", line 233, in_initialize_context
returnself._jvm.JavaSparkContext(jconf)
File "/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py",line 1401, in __call__
File"/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurredwhile calling None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: Yarn applicationhas already ended! It might have been killed or unable to launch applicationmaster.
atorg.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
atorg.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
atorg.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:497)
atorg.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
atsun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
atsun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
atjava.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
atpy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
atpy4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
atpy4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
>>>
>>>
运行应用程序
>>>from pyspark import SparkContext
>>> sc= SparkContext( 'local', 'test')
18/04/22 08:10:14WARN spark.SparkContext: Another SparkContext is being constructed (or threw anexception in its constructor). This mayindicate an error, since only one SparkContext may be running in this JVM (seeSPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(NativeMethod)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:236)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:748)
>>>logFile = "file:///usr/local/spark/README.md"
>>>logData = sc.textFile(logFile, 2).cache()
>>>numAs = logData.filter(lambda line: 'a' in line).count()
>>>numBs = logData.filter(lambda line: 'b' in line).count()
>>>print('Lines with a: %s, Lines with b: %s' % (numAs, numBs))
Lines with a:61, Lines with b: 27
最终结果运行正确,但是在web界面显示为failed,而且在运行过程出现很多警告和错误
出现的错误
18/04/22 07:55:10WARN util.NativeCodeLoader: Unable to load native-hadoop library for yourplatform... using builtin-java classes where applicable
18/04/2207:55:29 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive isset, falling back to uploading libraries under SPARK_HOME.
18/04/2208:02:33 ERROR spark.SparkContext: Error initializing SparkContext.
对于第一个警告
查看依赖库,可以看到竟然是32位的库
[email protected]:/usr/local/spark#cd ../hadoop-2.7.5/lib
[email protected]:/usr/local/hadoop-2.7.5/lib#ll
total 16
drwxr-xr-x 320415 systemd-journal 4096 Dec 16 01:12 ./
drwxr-xr-x 120415 systemd-journal 4096 Apr 18 08:10 ../
drwxr-xr-x 220415 systemd-journal 4096 Dec 16 01:12 native/
[email protected]:/usr/local/hadoop-2.7.5/lib#cd native/
[email protected]:/usr/local/hadoop-2.7.5/lib/native#ldd libhadoop.so.1.0.0
linux-vdso.so.1=> (0x00007ffc05203000)
libdl.so.2 =>/lib/x86_64-linux-gnu/libdl.so.2 (0x00007fd27ad52000)
libjvm.so =>not found
libc.so.6 =>/lib/x86_64-linux-gnu/libc.so.6 (0x00007fd27a988000)
/lib64/ld-linux-x86-64.so.2(0x00007fd27b175000)
查看容器系统是32位还是64位
[email protected]:/usr/local/hadoop-2.7.5/lib/native#getconf LONG_BIT
64
查看hadoop的位数
[email protected]:/usr/local/hadoop-2.7.5/lib/native#file libhadoop.so.1.0.0
libhadoop.so.1.0.0:ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked,BuildID[sha1]=612c3e78dc66f30ab64ef032524d92022eebe34e, not stripped
可以看到是64位的
修改
修改.bashrc
添加
exportHADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
exportHADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_COMMON_LIB_NATIVE_DIR"
别忘了source~/.bashrc
修改hadoop-env.sh
添加
exportHADOOP_HOME=/usr/local/hadoop-2.7.5/bin
exportHADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
exportHADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_COMMON_LIB_NATIVE_DIR"
继续运行,在输出信息中看到
18/04/2208:52:27 DEBUG util.NativeCodeLoader:java.library.path=/usr/local/hadoop-2.7.5/bin/lib:/usr/local/hadoop-2.7.5/bin/lib/native
18/04/2208:52:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library foryour platform... using builtin-java classes where applicable
查看hadoop本地库是否加载成功
[email protected]:/usr/local/spark/sbin#hadoop checknative
18/04/2209:12:13 DEBUG util.NativeCodeLoader: Trying to load the custom-builtnative-hadoop library...
18/04/2209:12:13 DEBUG util.NativeCodeLoader: Failed to load native-hadoop with error:java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
18/04/2209:12:13 DEBUG util.NativeCodeLoader:java.library.path=/usr/local/hadoop-2.7.5/bin/lib/native
18/04/2209:12:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library foryour platform... using builtin-java classes where applicable
18/04/2209:12:14 DEBUG util.Shell: setsid exited with exit code 0
Native librarychecking:
hadoop: false
zlib: false
snappy: false
lz4: false
bzip2: false
openssl: false
18/04/2209:12:14 INFO util.ExitUtil: Exiting with status 1
[email protected]:/usr/local/spark/sbin#
上面显示java.library.path=/usr/local/hadoop-2.7.5/bin/lib/native,明显不存在这个路径,是配置错误了,重新配置
在hadoop-env.sh中配置
exportHADOOP_COMMON_LIB_NATIVE_DIR=/usr/local/hadoop-2.7.5/lib/native
exportHADOOP_OPTS="-Djava.library.path=$HADOOP_COMMON_LIB_NATIVE_DIR"
修改后问题解决
[email protected]:/usr/local/spark/sbin#pyspark --master yarn-cluster
/usr/local/spark/bin/pyspark:line 53: python: command not found
Python 3.5.2(default, Nov 23 2017, 16:37:01)
[GCC 5.4.020160609] on linux
Type"help", "copyright", "credits" or"license" for more information.
Warning: Masteryarn-cluster is deprecated since 2.0. Please use master "yarn" withspecified deploy mode instead.
Error: Clusterdeploy mode is not applicable to Spark shells.
Run with --helpfor usage help or --verbose for debug output
Traceback (mostrecent call last):
File"/usr/local/spark/python/pyspark/shell.py", line 38, in<module>
SparkContext._ensure_initialized()
File"/usr/local/spark/python/pyspark/context.py", line 243, in_ensure_initialized
SparkContext._gateway = gateway orlaunch_gateway()
File"/usr/local/spark/python/pyspark/java_gateway.py", line 94, inlaunch_gateway
raise Exception("Java gateway processexited before sending the driver its port number")
Exception: Javagateway process exited before sending the driver its port number
>>>
注意
执行exportHADOOP_ROOT_LOGGER=DEBUG,console后,再启动hadoop会输出大量debug信息
使用pyspark--master yarn --deploy-mode cluster出现错误
Error: Clusterdeploy mode is not applicable to Spark shells.
因为spark-shell作为一个与用户交互的命令行,必须将Driver运行在本地,而不是yarn上。
第一个错误解决方法
在spark-env.sh中添加
exportLD_LIBRARY_PATH=/usr/local/hadoop-2.7.5/lib/native:$LD_LIBRARY_PATH //注意等号两边不要有空格
用yarn运行python程序
spark-submit--master yarn /root/pysparkfile/text.py
查看你yarn集群管理web界面